1. 31 7月, 2013 3 次提交
  2. 04 7月, 2013 37 次提交
    • P
      nbd: correct disconnect behavior · c378f70a
      Paul Clements 提交于
      Currently, when a disconnect is requested by the user (via NBD_DISCONNECT
      ioctl) the return from NBD_DO_IT is undefined (it is usually one of
      several error codes).  This means that nbd-client does not know if a
      manual disconnect was performed or whether a network error occurred.
      Because of this, nbd-client's persist mode (which tries to reconnect after
      error, but not after manual disconnect) does not always work correctly.
      
      This change fixes this by causing NBD_DO_IT to always return 0 if a user
      requests a disconnect.  This means that nbd-client can correctly either
      persist the connection (if an error occurred) or disconnect (if the user
      requested it).
      Signed-off-by: NPaul Clements <paul.clements@steeleye.com>
      Acked-by: NRob Landley <rob@landley.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c378f70a
    • A
      rapidio: change endpoint device name format · 6ca40c25
      Alexandre Bounine 提交于
      Change endpoint device name format to use a component tag value instead of
      device destination ID.
      
      RapidIO specification defines a component tag to be a unique identifier
      for devices in a network.  RapidIO switches already use component tag as
      part of their device name and also use it for device identification when
      processing error management event notifications.
      
      Forming an endpoint's device name using its component tag instead of
      destination ID allows to keep sysfs device directories unchanged in case
      if a routing process dynamically changes endpoint's destination ID as a
      result of route optimization.
      
      This change should not affect any existing users because a valid device
      destination ID always should be obtained by reading "destid" attribute and
      not by parsing device name.
      
      This patch also removes switchid member from struct rio_switch because it
      simply duplicates the component tag and does not have other use than in
      device name generation.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@Prodrive.nl>
      Cc: Micha Nelissen <micha.nelissen@Prodrive.nl>
      Cc: Stef van Os <stef.van.os@Prodrive.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ca40c25
    • A
      rapidio: add udev notification · 3bdbb62f
      Alexandre Bounine 提交于
      Add RapidIO-specific modalias generation to enable udev notifications
      about RapidIO-specific events.
      
      The RapidIO modalias string format is shown below:
      
      "rapidio:vNNNNdNNNNavNNNNadNNNN"
      
      Where:
      v  - Device Vendor ID (16 bit),
      d  - Device ID (16 bit),
      av - Assembly Vendor ID (16 bit),
      ad - Assembly ID (16 bit),
      
      as they are reported in corresponding Capability Registers (CARs)
      of each RapidIO device.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@Prodrive.nl>
      Cc: Micha Nelissen <micha.nelissen@Prodrive.nl>
      Cc: Stef van Os <stef.van.os@Prodrive.nl>
      Cc: Jean Delvare <jdelvare@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bdbb62f
    • A
      rapidio: update enumerator registration mechanism · 9edbc30b
      Alexandre Bounine 提交于
      Update enumeration/discovery method registration mechanism to allow
      loading enumeration/discovery methods before all mports are registered.
      
      Existing statically linked RapidIO subsystem expects that all available
      RapidIO mport devices are initialized and registered before the
      enumeration/discovery method is registered.  Switching to loadable mport
      device drivers creates situation when mport device driver can be loaded
      after enumeration/discovery method is attached (e.g., loadable mport
      driver in a system with statically linked RapidIO core and enumerator).
      This also will happen in a system with hot-pluggable RapidIO controllers.
      
      To remove the dependency on the initialization/registration order this
      patch introduces enumeration/discovery registration mechanism that
      supports arbitrary registration order of mports and enumerator/discovery
      methods.
      
      The following registration rules are implemented:
      - only one enumeration/discovery method can be registered for given mport ID
        (including RIO_MPORT_ANY);
      - when new enumeration/discovery methods tries to attach to the registered mport
        device, method with matching mport ID will replace a default method previously
        registered for given mport (if any);
      - enumeration/discovery method with target ID=RIO_MPORT_ANY will be attached
        only to mports that do not have another enumerator attached to them;
      - when new mport device is registered with RapidIO subsystem, registration
        routine searches for the enumeration/discovery method with the best matching
        mport ID;
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@Prodrive.nl>
      Cc: Micha Nelissen <micha.nelissen@Prodrive.nl>
      Cc: Stef van Os <stef.van.os@Prodrive.nl>
      Cc: Jean Delvare <jdelvare@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9edbc30b
    • A
      rapidio: convert switch drivers to modules · 2ec3ba69
      Alexandre Bounine 提交于
      Rework RapidIO switch drivers to add an option to build them as loadable
      kernel modules.
      
      This patch removes RapidIO-specific vmlinux section and converts switch
      drivers to be compatible with LDM driver registration method.  To simplify
      registration of device-specific callback routines this patch introduces
      rio_switch_ops data structure.  The sw_sysfs() callback is removed from
      the list of device-specific operations because under the new structure its
      functions can be handled by switch driver's probe() and remove() routines.
      
      If a specific switch device driver is not loaded the RapidIO subsystem
      core will use default standard-based operations to configure a switch.
      Because the current implementation of RapidIO enumeration/discovery method
      relies on availability of device-specific operations for error management,
      switch device drivers must be loaded before the RapidIO
      enumeration/discovery starts.
      
      This patch also moves several common routines from enumeration/discovery
      module into the RapidIO core code to make switch-specific operations
      accessible to all components of RapidIO subsystem.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@Prodrive.nl>
      Cc: Micha Nelissen <micha.nelissen@Prodrive.nl>
      Cc: Stef van Os <stef.van.os@Prodrive.nl>
      Cc: Jean Delvare <jdelvare@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec3ba69
    • O
      kernel/fork.c:copy_process(): don't add the uninitialized child to thread/task/pid lists · 81907739
      Oleg Nesterov 提交于
      copy_process() adds the new child to thread_group/init_task.tasks list and
      then does attach_pid(child, PIDTYPE_PID).  This means that the lockless
      next_thread() or next_task() can see this thread with the wrong pid.  Say,
      "ls /proc/pid/task" can list the same inode twice.
      
      We could move attach_pid(child, PIDTYPE_PID) up, but in this case
      find_task_by_vpid() can find the new thread before it was fully
      initialized.
      
      And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
      copy_process() initializes child->pids[*].pid first, then calls
      attach_pid() to insert the task into the pid->tasks list.
      
      attach_pid() no longer need the "struct pid*" argument, it is always
      called after pid_link->pid was already set.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81907739
    • O
      exit.c: unexport __set_special_pids() · 81dabb46
      Oleg Nesterov 提交于
      Move __set_special_pids() from exit.c to sys.c close to its single caller
      and make it static.
      
      And rename it to set_special_pids(), another helper with this name has
      gone away.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81dabb46
    • A
      ptrace: add ability to get/set signal-blocked mask · 29000cae
      Andrey Vagin 提交于
      crtools uses a parasite code for dumping processes.  The parasite code is
      injected into a process with help PTRACE_SEIZE.
      
      Currently crtools blocks signals from a parasite code.  If a process has
      pending signals, crtools wait while a process handles these signals.
      
      This method is not suitable for stopped tasks.  A stopped task can have a
      few pending signals, when we will try to execute a parasite code, we will
      need to drop SIGSTOP, but all other signals must remain pending, because a
      state of processes must not be changed during checkpointing.
      
      This patch adds two ptrace commands to set/get signal-blocked mask.
      
      I think gdb can use this commands too.
      
      [akpm@linux-foundation.org: be consistent with brace layout]
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29000cae
    • J
      lcd: add devm_lcd_device_{register,unregister}() · 1d0c48e6
      Jingoo Han 提交于
      These functions allow the driver core to automatically clean up any
      allocation made by lcd drivers.  Thus it simplifies the error paths.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d0c48e6
    • J
      backlight: add devm_backlight_device_{register,unregister}() · 8318fde4
      Jingoo Han 提交于
      These functions allow the driver core to automatically clean up any
      allocation made by backlight drivers.  Thus it simplifies the error
      paths.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8318fde4
    • B
      drivers/dma: remove unused support for MEMSET operations · 48a9db46
      Bartlomiej Zolnierkiewicz 提交于
      There have never been any real users of MEMSET operations since they
      have been introduced in January 2007 by commit 7405f74b ("dmaengine:
      refactor dmaengine around dma_async_tx_descriptor").  Therefore remove
      support for them for now, it can be always brought back when needed.
      
      [sebastian.hesselbarth@gmail.com: fix drivers/dma/mv_xor]
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Acked-by: NDan Williams <djbw@fb.com>
      Cc: Tomasz Figa <t.figa@samsung.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Kevin Hilman <khilman@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48a9db46
    • J
      dmi: add support for exact DMI matches in addition to substring matching · 5017b285
      Jani Nikula 提交于
      dmi_match() considers a substring match to be a successful match.  This is
      not always sufficient to distinguish between DMI data for different
      systems.  Add support for exact string matching using strcmp() in addition
      to the substring matching using strstr().
      
      The specific use case in the i915 driver is to allow us to use an exact
      match for D510MO, without also incorrectly matching D510MOV:
      
        {
      	.ident = "Intel D510MO",
      	.matches = {
      		DMI_MATCH(DMI_BOARD_VENDOR, "Intel"),
      		DMI_EXACT_MATCH(DMI_BOARD_NAME, "D510MO"),
      	},
        }
      Signed-off-by: NJani Nikula <jani.nikula@intel.com>
      Cc: <annndddrr@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Cornel Panceac <cpanceac@gmail.com>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5017b285
    • K
      drivers: avoid format strings in names passed to alloc_workqueue() · d8537548
      Kees Cook 提交于
      For the workqueue creation interfaces that do not expect format strings,
      make sure they cannot accidently be parsed that way.  Additionally, clean
      up calls made with a single parameter that would be handled as a format
      string.  Many callers are passing potentially dynamic string content, so
      use "%s" in those cases to avoid any potential accidents.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8537548
    • D
      err.h: IS_ERR() can accept __user pointers · e7152b97
      Dan Carpenter 提交于
      Sparse generates a false positive when you pass a __user or __iomem
      pointer to the IS_ERR() functions.
      
        drivers/rtc/rtc-ds1286.c:344:36: sparse: incorrect type in argument 1 (different address spaces)
        drivers/rtc/rtc-ds1286.c:344:36:    expected void const *ptr
        drivers/rtc/rtc-ds1286.c:344:36:    got unsigned int [noderef] [usertype] <asn:2>*rtcregs
      
      We can silence these by adding a __force here and upgrading to Sparse
      v0.4.5-rc1 or later.
      
      This change has no effect when using current Sparse releases.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NChristopher Li <sparse@chrisli.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7152b97
    • C
      sparsemem: add BUILD_BUG_ON when sizeof mem_section is non-power-of-2 · 55878e88
      Cody P Schafer 提交于
      Instead of leaving a hidden trap for the next person who comes along and
      wants to add something to mem_section, add a big fat warning about it
      needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
      to catch mistakes.
      
      Right now non-power-of-2 mem_sections cause a number of WARNs at boot
      (which don't clearly point to the size of mem_section as an issue), but
      the system limps on (temporarily, at least).
      
      This is based upon Dave Hansen's earlier RFC where he ran into the same
      issue:
      	"sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
      	http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.htmlSigned-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55878e88
    • J
      mm: kill free_all_bootmem_node() · e1280be0
      Jiang Liu 提交于
      Now nobody makes use of free_all_bootmem_node(), kill it.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1280be0
    • J
      mm: introduce helper function set_max_mapnr() · fccc9987
      Jiang Liu 提交于
      Introduce a helper function set_max_mapnr() to set global variable
      max_mapnr.
      
      Also unify condition compilation for max_mapnr with
      CONFIG_NEED_MULTIPLE_NODES instead of CONFIG_DISCONTIGMEM.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fccc9987
    • J
      mm: kill global variable num_physpages · 18954181
      Jiang Liu 提交于
      Now all references to num_physpages have been removed, so kill it.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18954181
    • J
      mm: introduce helper function mem_init_print_info() to simplify mem_init() · 7ee3d4e8
      Jiang Liu 提交于
      Introduce helper function mem_init_print_info() to simplify mem_init()
      across different architectures, which also unifies the format and
      information printed.
      
      Function mem_init_print_info() calculates memory statistics information
      without walking each page, so it should be a little faster on some
      architectures.
      
      Also introduce another helper get_num_physpages() to kill the global
      variable num_physpages.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ee3d4e8
    • J
      vmlinux.lds: add comments for global variables and clean up useless declarations · 1622d1ab
      Jiang Liu 提交于
      The original goal of this patchset is to fix the bug reported by
      https://bugzilla.kernel.org/show_bug.cgi?id=53501 Now it has also been
      expanded to reduce common code used by memory initializion.
      
      Patch 1-7:
      	1) add comments for global variables exported by vmlinux.lds
      	2) normalize global variables exported by vmlinux.lds
      Patch 8:
      	Introduce helper functions mem_init_print_info() and
      	get_num_physpages()
      Patch 9:
      	Avoid using global variable num_physpages at runtime
      Patch 10:
      	Don't update num_physpages in memory_hotplug.c
      Patch 11-40:
      	Modify arch mm initialization code to:
      	1) Simplify mem_init() by using mem_init_print_info()
      	2) Prepare for killing global variable num_physpages
      Patch 41:
      	Kill the global variable num_physpages
      
      With all patches applied, mem_init(), free_initmem(), free_initrd_mem()
      could be as simple as below.  This patch series has reduced about 1.2K
      lines of code in total.
      
      #ifndef CONFIG_DISCONTIGMEM
      void __init
      mem_init(void)
      {
      	max_mapnr = max_low_pfn;
      	free_all_bootmem();
      	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
      
      	mem_init_print_info(NULL);
      }
      #endif /* CONFIG_DISCONTIGMEM */
      
      void
      free_initmem(void)
      {
      	free_initmem_default(-1);
      }
      
      #ifdef CONFIG_BLK_DEV_INITRD
      void
      free_initrd_mem(unsigned long start, unsigned long end)
      {
      	free_reserved_area(start, end, -1, "initrd");
      }
      #endif
      
      Due to hardware resource limitations, I have only tested this on x86_64.
      And the messages reported on an x86_64 system are:
      
      Log message before applying patches:
      Memory: 7745676k/8910848k available (6934k kernel code, 836024k absent, 329148k reserved, 6343k data, 1012k init)
      
      Log message after applying patches:
      Memory: 7744624K/8074824K available (6969K kernel code, 1011K data, 2828K rodata, 1016K init, 9640K bss, 330200K reserved)
      
      Great thanks to Vineet Gupta for testing on ARC.
      
      This patch:
      
      Document global variables exported from vmlinux.lds.
      
      1) Add comments about usage guidelines for global variables exported
         from vmlinux.lds.S.
      2) Remove unused __initdata_begin[] and __initdata_end[].
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1622d1ab
    • J
      mm: use a dedicated lock to protect totalram_pages and zone->managed_pages · c3d5f5f0
      Jiang Liu 提交于
      Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
      protect totalram_pages and zone->managed_pages.  Other than the memory
      hotplug driver, totalram_pages and zone->managed_pages may also be
      modified at runtime by other drivers, such as Xen balloon,
      virtio_balloon etc.  For those cases, memory hotplug lock is a little
      too heavy, so introduce a dedicated lock to protect totalram_pages and
      zone->managed_pages.
      
      Now we have a simplified locking rules totalram_pages and
      zone->managed_pages as:
      
      1) no locking for read accesses because they are unsigned long.
      2) no locking for write accesses at boot time in single-threaded context.
      3) serialize write accesses at runtime by acquiring the dedicated
         managed_page_count_lock.
      
      Also adjust zone->managed_pages when freeing reserved pages into the
      buddy system, to keep totalram_pages and zone->managed_pages in
      consistence.
      
      [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3d5f5f0
    • J
      mm: accurately calculate zone->managed_pages for highmem zones · 7b4b2a0d
      Jiang Liu 提交于
      Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
      that all highmem pages will be freed into the buddy system by function
      mem_init().  But that's not always true, some architectures may reserve
      some highmem pages during boot.  For example PPC may allocate highmem
      pages for giagant HugeTLB pages, and several architectures have code to
      check PageReserved flag to exclude highmem pages allocated during boot
      when freeing highmem pages into the buddy system.
      
      So treat highmem pages in the same way as normal pages, that is to:
      1) reset zone->managed_pages to zero in mem_init().
      2) recalculate managed_pages when freeing pages into the buddy system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b4b2a0d
    • J
      mm: enhance free_reserved_area() to support poisoning memory with zero · dbe67df4
      Jiang Liu 提交于
      Address more review comments from last round of code review.
      1) Enhance free_reserved_area() to support poisoning freed memory with
         pattern '0'. This could be used to get rid of poison_init_mem()
         on ARM64.
      2) A previous patch has disabled memory poison for initmem on s390
         by mistake, so restore to the original behavior.
      3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbe67df4
    • J
      mm: change signature of free_reserved_area() to fix building warnings · 11199692
      Jiang Liu 提交于
      Change signature of free_reserved_area() according to Russell King's
      suggestion to fix following build warnings:
      
        arch/arm/mm/init.c: In function 'mem_init':
        arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
          free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
          ^
        In file included from include/linux/mman.h:4:0,
                         from arch/arm/mm/init.c:15:
        include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
         extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
      
         mm/page_alloc.c: In function 'free_reserved_area':
      >> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
         In file included from arch/mips/include/asm/page.h:49:0,
                          from include/linux/mmzone.h:20,
                          from include/linux/gfp.h:4,
                          from include/linux/mm.h:8,
                          from mm/page_alloc.c:18:
         arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
         mm/page_alloc.c: In function 'free_area_init_nodes':
         mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
      
      Also address some minor code review comments.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11199692
    • R
      swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES · dcf6b7dd
      Rafael Aquini 提交于
      Considering the use cases where the swap device supports discard:
      a) and can do it quickly;
      b) but it's slow to do in small granularities (or concurrent with other
         I/O);
      c) but the implementation is so horrendous that you don't even want to
         send one down;
      
      And assuming that the sysadmin considers it useful to send the discards down
      at all, we would (probably) want the following solutions:
      
        i. do the fine-grained discards for freed swap pages, if device is
           capable of doing so optimally;
       ii. do single-time (batched) swap area discards, either at swapon
           or via something like fstrim (not implemented yet);
      iii. allow doing both single-time and fine-grained discards; or
       iv. turn it off completely (default behavior)
      
      As implemented today, one can only enable/disable discards for swap, but
      one cannot select, for instance, solution (ii) on a swap device like (b)
      even though the single-time discard is regarded to be interesting, or
      necessary to the workload because it would imply (1), and the device is
      not capable of performing it optimally.
      
      This patch addresses the scenario depicted above by introducing a way to
      ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
      flagged through swapon(8) to allow a sysadmin to select the best suitable
      swap discard policy accordingly to system constraints.
      
      This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
      new flags to allow more flexibe swap discard policies being flagged
      through swapon(8).  The default behavior is to keep both single-time, or
      batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
      for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
      consistentcy with older kernel behavior, as well as maintain compatibility
      with older swapon(8).  However, through the new introduced flags the best
      suitable discard policy can be selected accordingly to any given swap
      device constraint.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6b7dd
    • T
      mm: tune vm_committed_as percpu_counter batching size · 917d9290
      Tim Chen 提交于
      Currently the per cpu counter's batch size for memory accounting is
      configured as twice the number of cpus in the system.  However, for
      system with very large memory, it is more appropriate to make it
      proportional to the memory size per cpu in the system.
      
      For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
      batch size is only 2*64 pages (0.5 MB).  So any memory accounting
      changes of more than 0.5MB will overflow the per cpu counter into the
      global counter.  Instead, for the new scheme, the batch size is
      configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is
      more inline with the memory size.
      
      I've done a repeated brk test of 800KB (from will-it-scale test suite)
      with 80 concurrent processes on a 4 socket Westmere machine with a total
      of 40 cores.  Without the patch, about 80% of cpu is spent on spin-lock
      contention within the vm_committed_as counter.  With the patch, there's
      a 73x speedup on the benchmark and the lock contention drops off almost
      entirely.
      
      [akpm@linux-foundation.org: fix section mismatch]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917d9290
    • W
      mm/hugetlb: remove hugetlb_prefault · 5f1e31d2
      Wanpeng Li 提交于
      hugetlb_prefault() is not used any more, this patch removes it.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1e31d2
    • W
      mm/pageblock: remove get/set_pageblock_flags · 4c42efa2
      Wanpeng Li 提交于
      get_pageblock_flags and set_pageblock_flags are not used any more, this
      patch removes them.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c42efa2
    • M
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman 提交于
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
    • M
      mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API · a0b8cab3
      Mel Gorman 提交于
      Now that the LRU to add a page to is decided at LRU-add time, remove the
      misleading lru parameter from __pagevec_lru_add.  A consequence of this
      is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
      helpers are misleading as the caller no longer has direct control over
      what LRU the page is added to.  Unused helpers are removed by this patch
      and existing users of pagevec_lru_add_file() are converted to use
      lru_cache_add_file() directly and use the per-cpu pagevecs instead of
      creating their own pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b8cab3
    • M
      mm: add tracepoints for LRU activation and insertions · c6286c98
      Mel Gorman 提交于
      Andrew Perepechko reported a problem whereby pages are being prematurely
      evicted as the mark_page_accessed() hint is ignored for pages that are
      currently on a pagevec --
      http://www.spinics.net/lists/linux-ext4/msg37340.html .
      
      Alexey Lyahkov and Robin Dong have also reported problems recently that
      could be due to hot pages reaching the end of the inactive list too
      quickly and be reclaimed.
      
      Rather than addressing this on a per-filesystem basis, this series aims
      to fix the mark_page_accessed() interface by deferring what LRU a page
      is added to pagevec drain time and allowing mark_page_accessed() to call
      SetPageActive on a pagevec page.
      
      Patch 1 adds two tracepoints for LRU page activation and insertion. Using
      	these processes it's possible to build a model of pages in the
      	LRU that can be processed offline.
      
      Patch 2 defers making the decision on what LRU to add a page to until when
      	the pagevec is drained.
      
      Patch 3 searches the local pagevec for pages to mark PageActive on
      	mark_page_accessed. The changelog explains why only the local
      	pagevec is examined.
      
      Patches 4 and 5 tidy up the API.
      
      postmark, a dd-based test and fs-mark both single and threaded mode were
      run but none of them showed any performance degradation or gain as a
      result of the patch.
      
      Using patch 1, I built a *very* basic model of the LRU to examine
      offline what the average age of different page types on the LRU were in
      milliseconds.  Of course, capturing the trace distorts the test as it's
      written to local disk but it does not matter for the purposes of this
      test.  The average age of pages in milliseconds were
      
      				    vanilla deferdrain
      Average age mapped anon:               1454       1250
      Average age mapped file:             127841     155552
      Average age unmapped anon:               85        235
      Average age unmapped file:            73633      38884
      Average age unmapped buffers:         74054     116155
      
      The LRU activity was mostly files which you'd expect for a dd-based
      workload.  Note that the average age of buffer pages is increased by the
      series and it is expected this is due to the fact that the buffer pages
      are now getting added to the active list when drained from the pagevecs.
      Note that the average age of the unmapped file data is decreased as they
      are still added to the inactive list and are reclaimed before the
      buffers.
      
      There is no guarantee this is a universal win for all workloads and it
      would be nice if the filesystem people gave some thought as to whether
      this decision is generally a win or a loss.
      
      This patch:
      
      Using these tracepoints it is possible to model LRU activity and the
      average residency of pages of different types.  This can be used to
      debug problems related to premature reclaim of pages of particular
      types.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6286c98
    • H
      vmalloc: introduce remap_vmalloc_range_partial · e69e9d4a
      HATAYAMA Daisuke 提交于
      We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
      space and remap it to user-space in order to reduce the risk that memory
      allocation fails on system with huge number of CPUs and so with huge ELF
      note segment that exceeds 11-order block size.
      
      Although there's already remap_vmalloc_range for the purpose of
      remapping vmalloc memory to user-space, we need to specify user-space
      range via vma.
       Mmap on /proc/vmcore needs to remap range across multiple objects, so
      the interface that requires vma to cover full range is problematic.
      
      This patch introduces remap_vmalloc_range_partial that receives user-space
      range as a pair of base address and size and can be used for mmap on
      /proc/vmcore case.
      
      remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.
      
      [akpm@linux-foundation.org: use PAGE_ALIGNED()]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e69e9d4a
    • A
      include/linux/mm.h: add PAGE_ALIGNED() helper · 0fa73b86
      Andrew Morton 提交于
      To test whether an address is aligned to PAGE_SIZE.
      
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fa73b86
    • C
      114d4b79
    • C
    • M
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman 提交于
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
    • M
      mm: vmscan: block kswapd if it is encountering pages under writeback · 283aba9f
      Mel Gorman 提交于
      Historically, kswapd used to congestion_wait() at higher priorities if
      it was not making forward progress.  This made no sense as the failure
      to make progress could be completely independent of IO.  It was later
      replaced by wait_iff_congested() and removed entirely by commit 258401a6
      (mm: don't wait on congested zones in balance_pgdat()) as it was
      duplicating logic in shrink_inactive_list().
      
      This is problematic.  If kswapd encounters many pages under writeback
      and it continues to scan until it reaches the high watermark then it
      will quickly skip over the pages under writeback and reclaim clean young
      pages or push applications out to swap.
      
      The use of wait_iff_congested() is not suited to kswapd as it will only
      stall if the underlying BDI is really congested or a direct reclaimer
      was unable to write to the underlying BDI.  kswapd bypasses the BDI
      congestion as it sets PF_SWAPWRITE but even if this was taken into
      account then it would cause direct reclaimers to stall on writeback
      which is not desirable.
      
      This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
      encountering too many pages under writeback.  If this flag is set and
      kswapd encounters a PageReclaim page under writeback then it'll assume
      that the LRU lists are being recycled too quickly before IO can complete
      and block waiting for some IO to complete.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283aba9f