1. 13 1月, 2012 40 次提交
    • K
      ttm/dma: Remove the WARN() which is not useful. · 0e113315
      Konrad Rzeszutek Wilk 提交于
      . It was useful during development, but now on a production system
      we can get this (if the user forgot to upload the firmware):
      
      [drm] radeon: irq initialized.
      [drm] GART: num cpu pages 131072, num gpu pages 131072
      [drm] radeon: ib pool ready.
      [drm] Loading SUMO Microcode
      r600_cp: Failed to load firmware "radeon/SUMO_pfp.bin"
      atl1c 0000:03:00.0: version 1.0.1.0-NAPI.213057] [drm:evergreen_startup] *ERROR* Failed to load firmware!
      radeon 0000:00:01.0: disabling GPU acceleration
      88] radeon 0000:00:01.0: ffff8801bb782400 unpin not necessary
      ------------[ cut here ]------------
      WARNING: at /home/konrad/linux-linus/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:956 ttm_dma_unpopulate+0x79/0x300 [ttm]()
      Hardware name: System Product Name
      Modules linked in: e1000e atl1c radeon(+) ahci libahci libata scsi_mod fbcon tileblit font ttm bitblit softcursor drm_kms_helper wmi xen_blkfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea xenfs xen_privcmd
      Pid: 1600, comm: modprobe Not tainted 3.2.0-06100-ge343a895 #1
      Call Trace:
       [<ffffffff8108973a>] warn_slowpath_common+0x7a/0xb0
       [<ffffffff81089785>] warn_slowpath_null+0x15/0x20
       [<ffffffffa0060309>] ttm_dma_unpopulate+0x79/0x300 [ttm]
       [<ffffffffa01341c0>] radeon_ttm_tt_unpopulate+0x120/0x130 [radeon]
       [<ffffffffa0056e0c>] ttm_tt_destroy+0x2c/0x70 [ttm]
       [<ffffffffa0057a4e>] ttm_bo_cleanup_memtype_use+0x3e/0x80 [ttm]
       [<ffffffffa00595a1>] ttm_bo_release+0x251/0x280 [ttm]
       [<ffffffffa0059610>] ttm_bo_unref+0x40/0x60 [ttm]
       [<ffffffffa0134d02>] radeon_bo_unref+0x42/0x80 [radeon]
       [<ffffffffa0186dfb>] radeon_sa_bo_manager_fini+0x6b/0x80 [radeon]
       [<ffffffffa0146b8f>] radeon_ib_pool_fini+0x6f/0x90 [radeon]
       [<ffffffffa014be49>] r100_ib_fini+0x19/0x20 [radeon]
       [<ffffffffa017b47e>] evergreen_init+0x1ee/0x2d0 [radeon]
      
      The big WARN() has nothing to do with the culprit - which is that
      the firmware was not loaded. So lets remove the WARN() from the TTM DMA code.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      0e113315
    • L
      Merge branch 'akpm' (aka "Andrew's patch-bomb, take two") · 09946950
      Linus Torvalds 提交于
      Andrew explains:
      
       - various misc stuff
      
       - Most of the rest of MM: memcg, threaded hugepages, others.
      
       - cpumask
      
       - kexec
      
       - kdump
      
       - some direct-io performance tweaking
      
       - radix-tree optimisations
      
       - new selftests code
      
         A note on this: often people will develop a new userspace-visible
         feature and will develop userspace code to exercise/test that
         feature.  Then they merge the patch and the selftest code dies.
         Sometimes we paste it into the changelog.  Sometimes the code gets
         thrown into Documentation/(!).
      
         This saddens me.  So this patch creates a bare-bones framework which
         will henceforth allow me to ask people to include their test apps in
         the kernel tree so we can keep them alive.  Then when people enhance
         or fix the feature, I can ask them to update the test app too.
      
         The infrastruture is terribly trivial at present - let's see how it
         evolves.
      
       - checkpoint/restart feature work.
      
         A note on this: this is a project by various mad Russians to perform
         c/r mainly from userspace, with various oddball helper code added
         into the kernel where the need is demonstrated.
      
         So rather than some large central lump of code, what we have is
         little bits and pieces popping up in various places which either
         expose something new or which permit something which is normally
         kernel-private to be modified.
      
         The overall project is an ongoing thing.  I've judged that the size
         and scope of the thing means that we're more likely to be successful
         with it if we integrate the support into mainline piecemeal rather
         than allowing it all to develop out-of-tree.
      
         However I'm less confident than the developers that it will all
         eventually work! So what I'm asking them to do is to wrap each piece
         of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
         eventually comes to tears and the project as a whole fails, it should
         be a simple matter to go through and delete all trace of it.
      
      This lot pretty much wraps up the -rc1 merge for me.
      
      * akpm: (96 commits)
        unlzo: fix input buffer free
        ramoops: update parameters only after successful init
        ramoops: fix use of rounddown_pow_of_two()
        c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
        c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
        c/r: introduce CHECKPOINT_RESTORE symbol
        selftests: new x86 breakpoints selftest
        selftests: new very basic kernel selftests directory
        radix_tree: take radix_tree_path off stack
        radix_tree: remove radix_tree_indirect_to_ptr()
        dio: optimize cache misses in the submission path
        vfs: cache request_queue in struct block_device
        fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
        drivers/parport/parport_pc.c: fix warnings
        panic: don't print redundant backtraces on oops
        sysctl: add the kernel.ns_last_pid control
        kdump: add udev events for memory online/offline
        include/linux/crash_dump.h needs elf.h
        kdump: fix crash_kexec()/smp_send_stop() race in panic()
        kdump: crashk_res init check for /sys/kernel/kexec_crash_size
        ...
      09946950
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7c17d86a
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
        pptp: Accept packet with seq zero
        RDS: Remove some unused iWARP code
        net: fsl: fec: handle 10Mbps speed in RMII mode
        drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
        drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
        ksz884x: fix mtu for VLAN
        net_sched: sfq: add optional RED on top of SFQ
        dp83640: Fix NOHZ local_softirq_pending 08 warning
        gianfar: Fix invalid TX frames returned on error queue when time stamping
        gianfar: Fix missing sock reference when processing TX time stamps
        phylib: introduce mdiobus_alloc_size()
        net: decrement memcg jump label when limit, not usage, is changed
        net: reintroduce missing rcu_assign_pointer() calls
        inet_diag: Rename inet_diag_req_compat into inet_diag_req
        inet_diag: Rename inet_diag_req into inet_diag_req_v2
        bond_alb: don't disable softirq under bond_alb_xmit
        mac80211: fix rx->key NULL pointer dereference in promiscuous mode
        nl80211: fix old station flags compatibility
        mdio-octeon: use an unique MDIO bus name.
        mdio-gpio: use an unique MDIO bus name.
        ...
      7c17d86a
    • S
      unlzo: fix input buffer free · 35f15268
      Sascha Hauer 提交于
      unlzo modifies the pointer to in_buf, so we have to free the original
      buffer, not the modified pointer.
      Signed-off-by: NSascha Hauer <s.hauer@pengutronix.de>
      Cc: Lasse Collin <lasse.collin@tukaani.org>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f15268
    • K
      ramoops: update parameters only after successful init · c755201e
      Kees Cook 提交于
      If a platform device exists on the system, but ramoops fails to attach to
      it, the module parameters are overridden before ramoops can fall back and
      try to use passed module parameters.  Move update to end of init routine.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Sergiu Iordache <sergiu@chromium.org>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c755201e
    • M
      ramoops: fix use of rounddown_pow_of_two() · fdb59507
      Marco Stornelli 提交于
      The return value of rounddown_pow_of_two wasn't evaluated, so the
      operation was a no-op.
      Signed-off-by: NMarco Stornelli <marco.stornelli@gmail.com>
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdb59507
    • C
      c/r: prctl: add PR_SET_MM codes to set up mm_struct entries · 028ee4be
      Cyrill Gorcunov 提交于
      When we restore a task we need to set up text, data and data heap sizes
      from userspace to the values a task had at checkpoint time.  This patch
      adds auxilary prctl codes for that.
      
      While most of them have a statistical nature (their values are involved
      into calculation of /proc/<pid>/statm output) the start_brk and brk values
      are used to compute an allowed size of program data segment expansion.
      Which means an arbitrary changes of this values might be dangerous
      operation.  So to restrict access the following requirements applied to
      prctl calls:
      
       - The process has to have CAP_SYS_ADMIN capability granted.
       - For all opcodes except start_brk/brk members an appropriate
         VMA area must exist and should fit certain VMA flags,
         such as:
         - code segment must be executable but not writable;
         - data segment must not be executable.
      
      start_brk/brk values must not intersect with data segment and must not
      exceed RLIMIT_DATA resource limit.
      
      Still the main guard is CAP_SYS_ADMIN capability check.
      
      Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
      otherwise these prctl calls will return -EINVAL.
      
      [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028ee4be
    • C
      c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4 · b3f7f573
      Cyrill Gorcunov 提交于
      The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
      involved into calculation of program text/data segment sizes (which might
      be seen in /proc/<pid>/statm) and into brk() call final address.
      
      For restore we need to know all these values.  While
      mm->start_code/end_code already present in /proc/$pid/stat, the rest
      members are not, so this patch brings them in.
      
      The restore procedure of these members is addressed in another patch using
      prctl().
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3f7f573
    • C
      c/r: introduce CHECKPOINT_RESTORE symbol · 067bce1a
      Cyrill Gorcunov 提交于
      For checkpoint/restore we need auxilary features being compiled into the
      kernel, such as additional prctl codes, /proc/<pid>/map_files and etc...
      but same time these features are not mandatory for a regular kernel so
      CHECKPOINT_RESTORE config symbol should bring a way to disable them all at
      once if one wish to get rid of additional functionality.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      067bce1a
    • F
      selftests: new x86 breakpoints selftest · 85bbddc3
      Frederic Weisbecker 提交于
      Bring a first selftest in the relevant directory.  This tests several
      combinations of breakpoints and watchpoints in x86, as well as icebp traps
      and int3 traps.  Given the amount of breakpoint regressions we raised
      after we merged the generic breakpoint infrastructure, such selftest
      became necessary and can still serve today as a basis for new patches that
      touch the do_debug() path.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85bbddc3
    • F
      selftests: new very basic kernel selftests directory · 274343ad
      Frederic Weisbecker 提交于
      Bring a new kernel selftests directory in tools/testing/selftests.  To
      add a new selftest, create a subdirectory with the sources and a
      makefile that creates a target named "run_test" then add the
      subdirectory name to the TARGET var in tools/testing/selftests/Makefile
      and tools/testing/selftests/run_tests script.
      
      This can help centralizing and maintaining any useful selftest that
      developers usually tend to let rust in peace on some random server.
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      274343ad
    • H
      radix_tree: take radix_tree_path off stack · e2bdb933
      Hugh Dickins 提交于
      Down, down in the deepest depths of GFP_NOIO page reclaim, we have
      shrink_page_list() calling __remove_mapping() calling __delete_from_
      swap_cache() or __delete_from_page_cache().
      
      You would not expect those to need much stack, but in fact they call
      radix_tree_delete(): which declares a 192-byte radix_tree_path array on
      its stack (to record the node,offsets it visits when descending, in case
      it needs to ascend to update them).  And if any tag is still set [1],
      that calls radix_tree_tag_clear(), which declares a further such
      192-byte radix_tree_path array on the stack.  (At least we have
      interrupts disabled here, so won't then be pushing registers too.)
      
      That was probably a good choice when most users were 32-bit (array of
      half the size), and adding fields to radix_tree_node would have bloated
      it unnecessarily.  But nowadays many are 64-bit, and each
      radix_tree_node contains a struct rcu_head, which is only used when
      freeing; whereas the radix_tree_path info is only used for updating the
      tree (deleting, clearing tags or setting tags if tagged) when a lock
      must be held, of no interest when accessing the tree locklessly.
      
      So add a parent pointer to the radix_tree_node, in union with the
      rcu_head, and remove all uses of the radix_tree_path.  There would be
      space in that union to save the offset when descending as before (we can
      argue that a lock must already be held to exclude other users), but
      recalculating it when ascending is both easy (a constant shift and a
      constant mask) and uncommon, so it seems better just to do that.
      
      Two little optimizations: no need to decrement height when descending,
      adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
      tag on a node and its ancestors, it need not ascend from that node
      again.
      
      perf on the radix tree test harness reports radix_tree_insert() as 2%
      slower (now having to set parent), but radix_tree_delete() 24% faster.
      Surely that's an exaggeration from rtth's artificially low map shift 3,
      but forcing it back to 6 still rates radix_tree_delete() 8% faster.
      
      [1] Can a pagecache tag (dirty, writeback or towrite) actually still be
      set at the time of radix_tree_delete()? Perhaps not if the filesystem is
      well-behaved.  But although I've not tracked any stack overflow down to
      this cause, I have observed a curious case in which a dirty tag is set
      and left set on tmpfs: page migration's migrate_page_copy() happens to
      use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and
      that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a
      filesystem which doesn't use tags, except for this stack depth issue.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nai Xia <nai.xia@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2bdb933
    • X
      radix_tree: remove radix_tree_indirect_to_ptr() · 928da837
      Xiao Guangrong 提交于
      It is not used anymore, remove it
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      928da837
    • A
      dio: optimize cache misses in the submission path · 65dd2aa9
      Andi Kleen 提交于
      Some investigation of a transaction processing workload showed that a
      major consumer of cycles in __blockdev_direct_IO is the cache miss while
      accessing the block size.  This is because it has to walk the chain from
      block_dev to gendisk to queue.
      
      The block size is needed early on to check alignment and sizes.  It's only
      done if the check for the inode block size fails.  But the costly block
      device state is unconditionally fetched.
      
      - Reorganize the code to only fetch block dev state when actually
        needed.
      
      Then do a prefetch on the block dev early on in the direct IO path.  This
      is worth it, because there is substantial code run before we actually
      touch the block dev now.
      
      - I also added some unlikelies to make it clear the compiler that block
        device fetch code is not normally executed.
      
      This gave a small, but measurable improvement on a large database
      benchmark (about 0.3%)
      
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65dd2aa9
    • A
      vfs: cache request_queue in struct block_device · 87192a2a
      Andi Kleen 提交于
      This makes it possible to get from the inode to the request_queue with one
      less cache miss.  Used in followon optimization.
      
      The livetime of the pointer is the same as the gendisk.
      
      This assumes that the queue will always stay the same in the gendisk while
      it's visible to block_devices.  I think that's safe correct?
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87192a2a
    • T
      fs/direct-io.c: calculate fs_count correctly in get_more_blocks() · ae55e1aa
      Tao Ma 提交于
      In get_more_blocks(), we use dio_count to calcuate fs_count and do some
      tricky things to increase fs_count if dio_count isn't aligned.  But
      actually it still has some corner cases that can't be coverd.  See the
      following example:
      
      	dio_write foo -s 1024 -w 4096
      
      (direct write 4096 bytes at offset 1024).  The same goes if the offset
      isn't aligned to fs_blocksize.
      
      In this case, the old calculation counts fs_count to be 1, but actually we
      will write into 2 different blocks (if fs_blocksize=4096).  The old code
      just works, since it will call get_block twice (and may have to allocate
      and create extents twice for filesystems like ext4).  So we'd better call
      get_block just once with the proper fs_count.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae55e1aa
    • A
      drivers/parport/parport_pc.c: fix warnings · 45dac90f
      Andrew Morton 提交于
      drivers/parport/parport_pc.c: In function '__check_irq':
      drivers/parport/parport_pc.c:3415: warning: return from incompatible pointer type
      drivers/parport/parport_pc.c: In function '__check_dma':
      drivers/parport/parport_pc.c:3417: warning: return from incompatible pointer type
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45dac90f
    • A
      panic: don't print redundant backtraces on oops · 6e6f0a1f
      Andi Kleen 提交于
      When an oops causes a panic and panic prints another backtrace it's pretty
      common to have the original oops data be scrolled away on a 80x50 screen.
      
      The second backtrace is quite redundant and not needed anyways.
      
      So don't print the panic backtrace when oops_in_progress is true.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e6f0a1f
    • P
      sysctl: add the kernel.ns_last_pid control · b8f566b0
      Pavel Emelyanov 提交于
      The sysctl works on the current task's pid namespace, getting and setting
      its last_pid field.
      
      Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
      to create a task with desired pid value.  This ability is required badly
      for the checkpoint/restore in userspace.
      
      This approach suits all the parties for now.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8f566b0
    • M
      kdump: add udev events for memory online/offline · f5138e42
      Michael Holzheu 提交于
      Currently no udev events for memory hotplug "online" and "offline" are
      generated:
      
        # udevadm monitor
        # echo offline > /sys/devices/system/memory/memory4/state
        ==> No event
      
      When kdump is loaded, kexec detects the current memory configuration and
      stores it in the pre-allocated ELF core header.  Therefore, for kdump it
      is necessary to reload the kdump kernel with kexec when the memory
      configuration changes (e.g.  for online/offline hotplug memory).
      
      In order to do this automatically, udev rules should be used.  This kernel
      patch adds udev events for "online" and "offline".  Together with this
      kernel patch, the following udev rules for online/offline have to be added
      to "/etc/udev/rules.d/98-kexec.rules":
      
        SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/etc/init.d/kdump restart"
        SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/etc/init.d/kdump restart"
      
      [sfr@canb.auug.org.au: fixups for class to subsystem conversion]
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5138e42
    • F
      include/linux/crash_dump.h needs elf.h · 1f536b9e
      Fabio Estevam 提交于
      Building an ARM target we get the following warnings:
      
        CC      arch/arm/kernel/setup.o
        In file included from arch/arm/kernel/setup.c:39:
        arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
        In file included from arch/arm/kernel/setup.c:24:
        include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition
      
      Quoting Russell King:
      
      "linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
      on stuff in asm/elf.h to determine how stuff inside this file is defined
      at parse time.
      
      So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
      get a different result from the situation where asm/elf.h is included
      before."
      
      So add elf.h header to crash_dump.h to avoid this problem.
      
      The original discussion about this can be found at:
      http://www.spinics.net/lists/arm-kernel/msg154113.htmlSigned-off-by: NFabio Estevam <fabio.estevam@freescale.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[3.2.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f536b9e
    • M
      kdump: fix crash_kexec()/smp_send_stop() race in panic() · 93e13a36
      Michael Holzheu 提交于
      When two CPUs call panic at the same time there is a possible race
      condition that can stop kdump.  The first CPU calls crash_kexec() and the
      second CPU calls smp_send_stop() in panic() before crash_kexec() finished
      on the first CPU.  So the second CPU stops the first CPU and therefore
      kdump fails:
      
      1st CPU:
        panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
      
      2nd CPU:
        panic()->crash_kexec()->kexec_mutex already held by 1st CPU
             ->smp_send_stop()-> stop 1st CPU (stop kdump)
      
      This patch fixes the problem by introducing a spinlock in panic that
      allows only one CPU to process crash_kexec() and the subsequent panic
      code.
      
      All other CPUs call the weak function panic_smp_self_stop() that stops the
      CPU itself.  This function can be overloaded by architecture code.  For
      example "tile" can use their lower-power "nap" instruction for that.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e13a36
    • M
      kdump: crashk_res init check for /sys/kernel/kexec_crash_size · bec013c4
      Michael Holzheu 提交于
      Currently it is possible to set the crash_size via the sysfs
      /sys/kernel/kexec_crash_size even if no crash kernel memory has been
      defined with the "crashkernel" parameter.  In this case "crashk_res" is
      not initialized and crashk_res.start = crashk_res.end = 0.  Unfortunately
      resource_size(&crashk_res) returns 1 in this case.  This breaks the s390
      implementation of crash_(un)map_reserved_pages().
      
      To fix the problem the correct "old_size" is now calculated in
      crash_shrink_memory().  "old_size is set to "0" if crashk_res is not
      initialized.  With this change crash_shrink_memory() will do nothing, when
      "crashk_res" is not initialized.  It will return "0" for "echo 0 >
      /sys/kernel/kexec_crash_size" and -EINVAL for "echo [not zero] >
      /sys/kernel/kexec_crash_size".
      
      In addition to that this patch also simplifies the "ret = -EINVAL" vs.
      "ret = 0" logic as suggested by Simon Horman.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: NDave Young <dyoung@redhat.com>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bec013c4
    • M
      kdump: add missing RAM resource in crash_shrink_memory() · 6480e5a0
      Michael Holzheu 提交于
      When shrinking crashkernel memory using /sys/kernel/kexec_crash_size for
      the newly added memory no RAM resource is created at the moment.
      
      Example:
      
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
        c0000000-cfffffff : Crash kernel
        d0000000-ffffffff : System RAM
      
        $ echo 0 > /sys/kernel/kexec_crash_size
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
                                         <<-- here is System RAM missing
        d0000000-ffffffff : System RAM
      
      One result of this bug is that the memory chunk can never be set offline
      using memory hotplug.  With this patch I insert a new "System RAM"
      resource for the released memory.  Then the upper example looks like the
      following:
      
        $ echo 0 > /sys/kernel/kexec_crash_size
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
        c0000000-cfffffff : System RAM   <<-- new rescoure
        d0000000-ffffffff : System RAM
      
      And now I can set chunk c0000000-cfffffff offline.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6480e5a0
    • W
      kexec: remove KMSG_DUMP_KEXEC · a3dd3323
      WANG Cong 提交于
      KMSG_DUMP_KEXEC is useless because we already save kernel messages inside
      /proc/vmcore, and it is unsafe to allow modules to do other stuffs in a
      crash dump scenario.
      
      [akpm@linux-foundation.org: fix powerpc build]
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3dd3323
    • W
      cpumask: update setup_node_to_cpumask_map() comments · 9512938b
      Wanlong Gao 提交于
      node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
      removed since commit 29c337a0 ("cpumask: remove obsolete node_to_cpumask
      now everyone uses cpumask_of_node").
      
      So update the comments for setup_node_to_cpumask_map().
      Signed-off-by: NWanlong Gao <gaowanlong@cn.fujitsu.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9512938b
    • K
      mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path · f1db7afd
      Kautuk Consul 提交于
      If either of the vas or vms arrays are not properly kzalloced, then the
      code jumps to the err_free label.
      
      The err_free label runs a loop to check and free each of the array members
      of the vas and vms arrays which is not required for this situation as none
      of the array members have been allocated till this point.
      
      Eliminate the extra loop we have to go through by introducing a new label
      err_free2 and then jumping to it.
      
      [akpm@linux-foundation.org: remove now-unneeded tests]
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1db7afd
    • H
      mm: rearrange putback_inactive_pages · 3f79768f
      Hugh Dickins 提交于
      There is sometimes confusion between the global putback_lru_pages() in
      migrate.c and the static putback_lru_pages() in vmscan.c: rename the
      latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
      move_active_pages_to_lru() helps shrink_active_list().
      
      Remove unused scan_control arg from putback_inactive_pages() and from
      update_isolated_counts().  Move clear_active_flags() inside
      update_isolated_counts().  Move NR_ISOLATED accounting up into
      shrink_inactive_list() itself, so the balance is clearer.
      
      Do the spin_lock_irq() before calling putback_inactive_pages() and
      spin_unlock_irq() after return from it, so that it better matches
      update_isolated_counts() and move_active_pages_to_lru().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f79768f
    • H
      mm: remove isolate_pages() · f626012d
      Hugh Dickins 提交于
      The isolate_pages() level in vmscan.c offers little but indirection: merge
      it into isolate_lru_pages() as the compiler does, and use the names
      nr_to_scan and nr_scanned in each case.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f626012d
    • H
      mm: remove del_page_from_lru, add page_off_lru · 1c1c53d4
      Hugh Dickins 提交于
      del_page_from_lru() repeats del_page_from_lru_list(), also working out
      which LRU the page was on, clearing the relevant bits.  Decouple those
      functions: remove del_page_from_lru() and add page_off_lru().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c1c53d4
    • H
      mm: enum lru_list lru · 4111304d
      Hugh Dickins 提交于
      Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4111304d
    • H
      mm: no blank line after EXPORT_SYMBOL in swap.c · 4d06f382
      Hugh Dickins 提交于
      checkpatch rightly protests
      
        WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
      
      so fix the five offenders in mm/swap.c.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d06f382
    • H
      mm: fewer underscores in ____pagevec_lru_add · 5095ae83
      Hugh Dickins 提交于
      What's so special about ____pagevec_lru_add() that it needs four leading
      underscores?  Nothing, it just helped to distinguish from
      __pagevec_lru_add() in 2.6.28 development.  Cut two leading underscores.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5095ae83
    • H
      mm: take pagevecs off reclaim stack · 2bcf8879
      Hugh Dickins 提交于
      Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
      by lists of pages_to_free: then apply Konstantin Khlebnikov's
      free_hot_cold_page_list() to them instead of pagevec_release().
      
      Which simplifies the flow (no need to drop and retake lock whenever
      pagevec fills up) and reduces stale addresses in stack backtraces
      (which often showed through the pagevecs); but more importantly,
      removes another 120 bytes from the deepest stacks in page reclaim.
      Although I've not recently seen an actual stack overflow here with
      a vanilla kernel, move_active_pages_to_lru() has often featured in
      deep backtraces.
      
      However, free_hot_cold_page_list() does not handle compound pages
      (nor need it: a Transparent HugePage would have been split by the
      time it reaches the call in shrink_page_list()), but it is possible
      for putback_lru_pages() or move_active_pages_to_lru() to be left
      holding the last reference on a THP, so must exclude the unlikely
      compound case before putting on pages_to_free.
      
      Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
      The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
      but that is never on the reclaim path, and cannot be replaced by a list.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bcf8879
    • H
      memcg: fix mem_cgroup_print_bad_page · 90b3feae
      Hugh Dickins 提交于
      If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page()
      shows a "Bad page state" message, removes page from circulation, adds a
      taint and continues.  This is at a very low level, often when a spinlock
      is held (sometimes when page table lock is held, for example).
      
      We want to recover from this badness, not make it worse: we must not
      kmalloc memory here, we must not do a cgroup path lookup via dubious
      pointers.  No doubt that code was useful to debug a particular case at one
      time, and may be again, but take it out of the mainline kernel.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b3feae
    • H
      memcg: fix split_huge_page_refcounts() · 12d27107
      Hugh Dickins 提交于
      This patch started off as a cleanup: __split_huge_page_refcounts() has to
      cope with two scenarios, when the hugepage being split is already on LRU,
      and when it is not; but why does it have to split that accounting across
      three different sites?  Consolidate it in lru_add_page_tail(), handling
      evictable and unevictable alike, and use standard add_page_to_lru_list()
      when accounting is needed (when the head is not yet on LRU).
      
      But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
      test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
      under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
      messing up reclaim calculations and causing a freeze at rmdir of cgroup.
      
      Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
      count - this has not been the only such incident.  Document that
      lru_add_page_tail() is for Transparent HugePages by #ifdef around it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12d27107
    • M
      mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone · 0cee34fd
      Mel Gorman 提交于
      If compaction can proceed for a given zone, shrink_zones() does not
      reclaim any more pages from it.  After commit [e0c23279: vmscan: abort
      reclaim/compaction if compaction can proceed], do_try_to_free_pages()
      tries to finish as soon as possible once one zone can compact.
      
      This was intended to prevent slabs being shrunk unnecessarily but there
      are side-effects.  One is that a small zone that is ready for compaction
      will abort reclaim even if the chances of successfully allocating a THP
      from that zone is small.  It also means that reclaim can return too early
      even though sc->nr_to_reclaim pages were not reclaimed.
      
      This partially reverts the commit until it is proven that slabs are really
      being shrunk unnecessarily but preserves the check to return 1 to avoid
      OOM if reclaim was aborted prematurely.
      
      [aarcange@redhat.com: This patch replaces a revert from Andrea]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cee34fd
    • M
      mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available · fe4b1b24
      Mel Gorman 提交于
      In commit e0887c19 ("vmscan: limit direct reclaim for higher order
      allocations"), Rik noted that reclaim was too aggressive when THP was
      enabled.  In his initial patch he used the number of free pages to decide
      if reclaim should abort for compaction.  My feedback was that reclaim and
      compaction should be using the same logic when deciding if reclaim should
      be aborted.
      
      Unfortunately, this had the effect of reducing THP success rates when the
      workload included something like streaming reads that continually
      allocated pages.  The window during which compaction could run and return
      a THP was too small.
      
      This patch combines Rik's two patches together.  compaction_suitable() is
      still used to decide if reclaim should be aborted to allow compaction is
      used.  However, it will also ensure that there is a reasonable buffer of
      free pages available.  This improves upon the THP allocation success rates
      but bounds the number of pages that are freed for compaction.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe4b1b24
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred · 66199712
      Mel Gorman 提交于
      If compaction is deferred, direct reclaim is used to try to free enough
      pages for the allocation to succeed.  For small high-orders, this has a
      reasonable chance of success.  However, if the caller has specified
      __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
      to fail the allocation rather than stall the caller in direct reclaim.
      This patch skips direct reclaim if compaction is deferred and the caller
      specifies __GFP_NO_KSWAPD.
      
      Async compaction only considers a subset of pages so it is possible for
      compaction to be deferred prematurely and not enter direct reclaim even in
      cases where it should.  To compensate for this, this patch also defers
      compaction only if sync compaction failed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66199712