提交 · 7e47682ea555e7c1edef1d8fd96e2aa4c12abe59 · openeuler / raspberrypi-kernel

15 7月, 2015 1 次提交

cgroup: allow a cgroup subsystem to reject a fork · 7e47682e

由 Aleksa Sarai 提交于 6月 09, 2015

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected by a cgroup
policy. In addition, add a cancel_fork callback so that if an error
occurs later in the forking process, any state modified by can_fork can
be reverted.

Allow for a private opaque pointer to be passed from cgroup_can_fork to
cgroup_post_fork, allowing for the fork state to be stored by each
subsystem separately.

Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG>
enumerations to be be defined and used. In addition, explicitly add a
CGROUP_CANFORK_COUNT macro to make arrays easier to define.

This is in preparation for implementing the pids cgroup subsystem.
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

7e47682e

05 7月, 2015 2 次提交

NTB: Split ntb_hw_intel and ntb_transport drivers · e26a5843

由 Allen Hubbe 提交于 4月 09, 2015

Change ntb_hw_intel to use the new NTB hardware abstraction layer.

Split ntb_transport into its own driver.  Change it to use the new NTB
hardware abstraction layer.
Signed-off-by: NAllen Hubbe <Allen.Hubbe@emc.com>
Signed-off-by: NJon Mason <jdmason@kudzu.us>

e26a5843

NTB: Add NTB hardware abstraction layer · a1bd3bae

由 Allen Hubbe 提交于 4月 09, 2015

Abstract the NTB device behind a programming interface, so that it can
support different hardware and client drivers.
Signed-off-by: NAllen Hubbe <Allen.Hubbe@emc.com>
Signed-off-by: NJon Mason <jdmason@kudzu.us>

a1bd3bae

04 7月, 2015 3 次提交

sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h · 6b55c965

由 Srikar Dronamraju 提交于 6月 25, 2015

Currently print_cfs_rq() is declared in include/linux/sched.h.
However it's not used outside kernel/sched. Hence move the
declaration to kernel/sched/sched.h

Also some functions are only available for CONFIG_SCHED_DEBUG=y.
Hence move the declarations to within the #ifdef.
Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6b55c965

sched/stat: Simplify the sched_info accounting dependency · f6db8347

由 Naveen N. Rao 提交于 6月 25, 2015

Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
sched_info, which results in ugly #if clauses.

Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
switch, selected by both.
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: a.p.zijlstra@chello.nl
Cc: ricklind@us.ibm.com
Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

f6db8347

sched, preempt_notifier: separate notifier registration from static_key inc/dec · 2ecd9d29

由 Peter Zijlstra 提交于 7月 03, 2015

Commit 1cde2930 ("sched/preempt: Add static_key() to preempt_notifiers")
had two problems.  First, the preempt-notifier API needs to sleep with the
addition of the static_key, we do however need to hold off preemption
while modifying the preempt notifier list, otherwise a preemption could
observe an inconsistent list state.  KVM correctly registers and
unregisters preempt notifiers with preemption disabled, so the sleep
caused dmesg splats.

Second, KVM registers and unregisters preemption notifiers very often
(in vcpu_load/vcpu_put).  With a single uniprocessor guest the static key
would move between 0 and 1 continuously, hitting the slow path on every
userspace exit.

To fix this, wrap the static_key inc/dec in a new API, and call it from
KVM.

Fixes: 1cde2930 ("sched/preempt: Add static_key() to preempt_notifiers")
Reported-by: NPontus Fuchs <pontus.fuchs@gmail.com>
Reported-by: NTakashi Iwai <tiwai@suse.de>
Tested-by: NTakashi Iwai <tiwai@suse.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ecd9d29

03 7月, 2015 1 次提交

irqchip: Move IRQCHIP_DECLARE macro to include/linux/irqchip.h · 91e20b50

由 Joel Porquet 提交于 7月 02, 2015

At the moment the IRQCHIP_DECLARE macro is only declared locally in
drivers/irqchip/irqchip.h. It prevents from using it directly in arch/*
directories whenever irqchip drivers only exist there, which happens in a few
cases (e.g. arc, arm, microblaze and mips).

This patch makes the macro to be globally defined, i.e. in
include/linux/irqchip.h, and thus usable for arch-specific declarations of
irqchip drivers. In this way, it is very similar to what clocksource does (ie
CLOCKSOURCE_OF_DECLARE is defined in include/linux/clocksource.h).

For now, this patch only moves the declaration of the macro
IRQCHIP_DECLARE to the global header 'include/linux/irqchip.h' and make
'drivers/irqchip/irqchip.h' include 'include/linux/irqchip.h'. Later, other
patches will get rid of 'drivers/irqchip/irqchip.h' and modify all the impacted
irqchip drivers.
Signed-off-by: NJoel Porquet <joel@porquet.org>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1435865565-14114-1-git-send-email-joel@porquet.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

91e20b50

02 7月, 2015 2 次提交

writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8

由 Tejun Heo 提交于 7月 02, 2015

52ebea74 ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's).  As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

To fix the bug, bdi destruction will be updated to not wait for cong's
to drain, which naturally means that cong's may outlive the associated
bdi.  This is fine for non-root cong's but is problematic for the root
cong's which are embedded in their bdi's as they may end up getting
dereferenced after the containing bdi's are freed.

This patch makes root cong's behave the same as non-root cong's.  They
are no longer embedded in their bdi's but allocated separately during
bdi initialization, indexed and reference counted the same way.

* As cong handling is the same for all wb's, wb->congested
  initialization is moved into wb_init().

* When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
  bdi->wb_congested is now a pointer pointing to the root cong
  allocated during bdi init and minimal refcnting operations are
  implemented.

* The above makes root wb init paths diverge depending on
  CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().

This patch in itself shouldn't cause any consequential behavior
differences but prepares for the actual fix.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJon Christopherson <jon@jons.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: NJon Christopherson <jon@jons.org>

Added <linux/slab.h> include to backing-dev.h for kfree() definition.
Signed-off-by: NJens Axboe <axboe@fb.com>

a13f35e8

NTB: Move files in preparation for NTB abstraction · ec110bc7

由 Allen Hubbe 提交于 5月 07, 2015

This patch only moves files to their new locations, before applying the
next two patches adding the NTB Abstraction layer.  Splitting this patch
from the next is intended make distinct which code is changed only due
to moving the files, versus which are substantial code changes in adding
the NTB Abstraction layer.
Signed-off-by: NAllen Hubbe <Allen.Hubbe@emc.com>
Signed-off-by: NJon Mason <jdmason@kudzu.us>

ec110bc7

01 7月, 2015 16 次提交

sysfs: Add support for permanently empty directories to serve as mount points. · 87d2846f

由 Eric W. Biederman 提交于 5月 13, 2015

Add two functions sysfs_create_mount_point and
sysfs_remove_mount_point that hang a permanently empty directory off
of a kobject or remove a permanently emptpy directory hanging from a
kobject.  Export these new functions so modular filesystems can use
them.

Cc: stable@vger.kernel.org
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

87d2846f

kernfs: Add support for always empty directories. · ea015218

由 Eric W. Biederman 提交于 5月 13, 2015

Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

ea015218

sysctl: Allow creating permanently empty directories that serve as mountpoints. · f9bd6733

由 Eric W. Biederman 提交于 5月 09, 2015

Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

f9bd6733

fs: Add helper functions for permanently empty directories. · fbabfd0f

由 Eric W. Biederman 提交于 5月 09, 2015

To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories.  Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

fbabfd0f

fs/file.c: don't acquire files->file_lock in fd_install() · 8a81252b

由 Eric Dumazet 提交于 6月 30, 2015

Mateusz Guzik reported :

 Currently obtaining a new file descriptor results in locking fdtable
 twice - once in order to reserve a slot and second time to fill it.

Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.

Mateusz provided an RFC patch and a micro benchmark :
  http://people.redhat.com/~mguzik/pipebench.c

A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.

We can use RCU instead of the spinlock.

__fd_install() must wait if a resize is in progress.

The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())

resize should be attempted by a single thread to not waste resources.

rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.

It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NMateusz Guzik <mguzik@redhat.com>
Acked-by: NMateusz Guzik <mguzik@redhat.com>
Tested-by: NMateusz Guzik <mguzik@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8a81252b

genalloc: rename of_get_named_gen_pool() to of_gen_pool_get() · abdd4a70

由 Vladimir Zapolskiy 提交于 6月 30, 2015

To be consistent with other kernel interface namings, rename
of_get_named_gen_pool() to of_gen_pool_get().  In the original function
name "_named" suffix references to a device tree property, which contains
a phandle to a device and the corresponding device driver is assumed to
register a gen_pool object.

Due to a weak relation and to avoid any confusion (e.g.  in future
possible scenario if gen_pool objects are named) the suffix is removed.

[sfr@canb.auug.org.au: crypto/marvell/cesa - fix up for of_get_named_gen_pool() rename]
Signed-off-by: NVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jaroslav Kysela <perex@perex.cz>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Boris BREZILLON <boris.brezillon@free-electrons.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

abdd4a70

genalloc: rename dev_get_gen_pool() to gen_pool_get() · 0030edf2

由 Vladimir Zapolskiy 提交于 6月 30, 2015

To be consistent with other genalloc interface namings, rename
dev_get_gen_pool() to gen_pool_get().  The original omitted "dev_" prefix
is removed, since it points to argument type of the function, and so it
does not bring any useful information.

[akpm@linux-foundation.org: update arch/arm/mach-socfpga/pm.c]
Signed-off-by: NVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Acked-by: NNicolas Ferre <nicolas.ferre@atmel.com>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Mark Brown <broonie@kernel.org>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Alan Tull <atull@opensource.altera.com>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0030edf2

drivers/scsi/scsi_debug.c: resolve sg buffer const-ness issue · 386ecb12

由 Dave Gordon 提交于 6月 30, 2015

do_device_access() takes a separate parameter to indicate the direction of
data transfer, which it used to use to select the appropriate function out
of sg_pcopy_{to,from}_buffer().  However these two functions now have

So this patch makes it bypass these wrappers and call the underlying
function sg_copy_buffer() directly; this has the same calling style as
do_device_access() i.e.  a separate direction-of-transfer parameter and no
pointers-to-const, so skipping the wrappers not only eliminates the
warning, it also make the code simpler :)

[akpm@linux-foundation.org: fix very broken build]
Signed-off-by: NDave Gordon <david.s.gordon@intel.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

386ecb12

lib/scatterlist: mark input buffer parameters as 'const' · 2a1bf8f9

由 Dave Gordon 提交于 6月 30, 2015

The 'buf' parameter of sg(p)copy_from_buffer() can and should be
const-qualified, although because of the shared implementation of
_to_buffer() and _from_buffer(), we have to cast this away internally.

This means that callers who have a 'const' buffer containing the data to
be copied to the sg-list no longer have to cast away the const-ness
themselves.  It also enables improved coverage by code analysis tools.
Signed-off-by: NDave Gordon <david.s.gordon@intel.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2a1bf8f9

kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path · 5375b708

由 HATAYAMA Daisuke 提交于 6月 30, 2015

Commit f06e5153 ("kernel/panic.c: add "crash_kexec_post_notifiers"
option for kdump after panic_notifers") introduced
"crash_kexec_post_notifiers" kernel boot option, which toggles wheather
panic() calls crash_kexec() before panic_notifiers and dump kmsg or after.

The problem is that the commit overlooks panic_on_oops kernel boot option.
 If it is enabled, crash_kexec() is called directly without going through
panic() in oops path.

To fix this issue, this patch adds a check to "crash_kexec_post_notifiers"
in the condition of kexec_should_crash().

Also, put a comment in kexec_should_crash() to explain not obvious things
on this patch.
Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: NBaoquan He <bhe@redhat.com>
Tested-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5375b708

mm: meminit: finish initialisation of struct pages before basic setup · 0e1cc95b

由 Mel Gorman 提交于 6月 30, 2015

Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred.  One approach is to initialise
memory on demand but it interferes with page allocator paths.  This patch
creates dedicated threads to initialise memory before basic setup.  It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill.  This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.

[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Waiman Long <waiman.long@hp.com
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Scott Norton <scott.norton@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0e1cc95b

mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set · 3a80a7fa

由 Mel Gorman 提交于 6月 30, 2015

This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set.  That config option cannot be set
but will be available in a later patch.  Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3a80a7fa

mm: meminit: inline some helper functions · 75a592a4

由 Mel Gorman 提交于 6月 30, 2015

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation.  As well as
unnecessary visibility, it's unnecessary function call overhead when
initialising pages.  This patch moves the helpers inline.

[akpm@linux-foundation.org: fix build]
[mhocko@suse.cz: fix build]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

75a592a4

mm: meminit: make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid · 8a942fde

由 Mel Gorman 提交于 6月 30, 2015

__early_pfn_to_nid() use static variables to cache recent lookups as
memblock lookups are very expensive but it assumes that memory
initialisation is single-threaded.  Parallel initialisation of struct
pages will break that assumption so this patch makes __early_pfn_to_nid()
SMP-safe by requiring the caller to cache recent search information.
early_pfn_to_nid() keeps the same interface but is only safe to use early
in boot due to the use of a global static variable.  meminit_pfn_in_nid()
is an SMP-safe version that callers must maintain their own state for.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a942fde

mm: meminit: only set page reserved in the memblock region · 92923ca3

由 Nathan Zimmer 提交于 6月 30, 2015

Currently each page struct is set as reserved upon initialization.  This
patch leaves the reserved bit clear and only sets the reserved bit when it
is known the memory was allocated by the bootmem allocator.  This makes it
easier to distinguish between uninitialised struct pages and reserved
struct pages in later patches.
Signed-off-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

92923ca3

memblock: introduce a for_each_reserved_mem_region iterator · 8e7a7f86

由 Robin Holt 提交于 6月 30, 2015

Struct page initialisation had been identified as one of the reasons why
large machines take a long time to boot. Patches were posted a long time ago
to defer initialisation until they were first used.  This was rejected on
the grounds it should not be necessary to hurt the fast paths. This series
reuses much of the work from that time but defers the initialisation of
memory to kswapd so that one thread per node initialises memory local to
that node.

After applying the series and setting the appropriate Kconfig variable I
see this in the boot log on a 64G machine

[    7.383764] kswapd 0 initialised deferred memory in 188ms
[    7.404253] kswapd 1 initialised deferred memory in 208ms
[    7.411044] kswapd 3 initialised deferred memory in 216ms
[    7.411551] kswapd 2 initialised deferred memory in 216ms

On a 1TB machine, I see

[    8.406511] kswapd 3 initialised deferred memory in 1116ms
[    8.428518] kswapd 1 initialised deferred memory in 1140ms
[    8.435977] kswapd 0 initialised deferred memory in 1148ms
[    8.437416] kswapd 2 initialised deferred memory in 1148ms

Once booted the machine appears to work as normal. Boot times were measured
from the time shutdown was called until ssh was available again.  In the
64G case, the boot time savings are negligible. On the 1TB machine, the
savings were 16 seconds.

Nate Zimmer said:

: On an older 8 TB box with lots and lots of cpus the boot time, as
: measure from grub to login prompt, the boot time improved from 1484
: seconds to exactly 1000 seconds.

Waiman Long said:

: I ran a bootup timing test on a 12-TB 16-socket IvyBridge-EX system.  From
: grub menu to ssh login, the bootup time was 453s before the patch and 265s
: after the patch - a saving of 188s (42%).

Daniel Blueman said:

: On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're seeing
: stock 4.0 boot in 7136s.  This drops to 2159s, or a 70% reduction with
: this patchset.  Non-temporal PMD init (https://lkml.org/lkml/2015/4/23/350)
: drops this to 1045s.

This patch (of 13):

As part of initializing struct page's in 2MiB chunks, we noticed that at
the end of free_all_bootmem(), there was nothing which had forced the
reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.
Signed-off-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NNate Zimmer <nzimmer@sgi.com>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Tested-by: NNate Zimmer <nzimmer@sgi.com>
Tested-by: NWaiman Long <waiman.long@hp.com>
Tested-by: NDaniel J Blueman <daniel@numascale.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e7a7f86

30 6月, 2015 1 次提交

Fix kmalloc slab creation sequence · a9730fca

由 Christoph Lameter 提交于 6月 29, 2015

This patch restores the slab creation sequence that was broken by commit
4066c33d and also reverts the portions that introduced the
KMALLOC_LOOP_XXX macros. Those can never really work since the slab creation
is much more complex than just going from a minimum to a maximum number.

The latest upstream kernel boots cleanly on my machine with a 64 bit x86
configuration under KVM using either SLAB or SLUB.

Fixes: 4066c33d ("support the slub_debug boot option")
Reported-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9730fca

29 6月, 2015 1 次提交

watchdog: watchdog_core: Add watchdog registration deferral mechanism · ef90174f

由 Jean-Baptiste Theou 提交于 6月 09, 2015

Currently, watchdog subsystem require the misc subsystem to
register a watchdog. This may not be the case in case of an
early registration of a watchdog, which can be required when
the watchdog cannot be disabled.

This patch introduces a deferral mechanism to remove this requirement.
Signed-off-by: NJean-Baptiste Theou <jtheou@adeneo-embedded.us>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NWim Van Sebroeck <wim@iguana.be>

ef90174f

28 6月, 2015 1 次提交

param: fix module param locks when !CONFIG_SYSFS. · cf2fde7b

由 Rusty Russell 提交于 6月 26, 2015

As Dan Streetman points out, the entire point of locking for is to
stop sysfs accesses, so they're elided entirely in the !SYSFS case.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

cf2fde7b

27 6月, 2015 3 次提交

NFSv4.2: LAYOUTSTATS is optional to implement · 6c5a0d89

由 Trond Myklebust 提交于 6月 27, 2015

Make it so, by checking the return value for NFS4ERR_MOTSUPP and
caching the information as a server capability.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

6c5a0d89

genirq: Implement irq_set_handler_locked()/irq_set_chip_handler_name_locked() · bbc9d21f

由 Thomas Gleixner 提交于 6月 23, 2015

The main use case for the exisiting __irq_set_*_locked() inlines is to
replace the handler [,chip and name] of an interrupt from a region
which has the irq descriptor lock held, e.g. from the irq_set_type()
callback. The first argument is the irq number, so the functions need
so perform a pointless lookup of the interrupt descriptor for those
cases which have the irq_data pointer handy.

Provide new functions which take an irq_data pointer instead of the
interrupt number, so the lookup of the interrupt descriptor can be
avoided.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>

Conflicts:
	include/linux/irqdesc.h

bbc9d21f

genirq: Introduce helper irq_desc_get_irq() · 304adf8a

由 Jiang Liu 提交于 6月 04, 2015

Introduce helper irq_desc_get_irq() to retrieve the irq number from
the irq descriptor.
Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433391238-19471-17-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

304adf8a

26 6月, 2015 9 次提交

arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952

由 Ross Zwisler 提交于 6月 25, 2015

Based on an original patch by Ross Zwisler [1].

Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media.  Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range).  A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.

This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[djbw: various reworks]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

61031952

libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3

由 Toshi Kani 提交于 6月 19, 2015

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
  /sys/bus/nd/devices
  |-- btt0.0/numa_node:0
  |-- btt1.0/numa_node:1
  |-- btt1.1/numa_node:1
  |-- namespace0.0/numa_node:0
  |-- namespace1.0/numa_node:1
  |-- region0/numa_node:0
  |-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
  /sys/class/block/pmem0/device/numa_node:0
  /sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
  numactl --preferred block:pmem0 --show
  numactl --preferred file:/dev/pmem1s --show
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

74ae66c3

libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6

由 Toshi Kani 提交于 6月 19, 2015

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41d7a6d6

acpi: Add acpi_map_pxm_to_online_node() · 99759869

由 Toshi Kani 提交于 6月 19, 2015

The kernel initializes CPU & memory's NUMA topology from ACPI
SRAT table.  Some other ACPI tables, such as NFIT and DMAR, also
contain proximity IDs for their device's NUMA topology.  This
information can be used to improve performance of these devices.

This patch introduces acpi_map_pxm_to_online_node(), which is
similar to acpi_map_pxm_to_node(), but always returns an online
node.  When the mapped node from a given proximity ID is offline,
it looks up the node distance table and returns the nearest
online node.

ACPI device drivers, which are called after the NUMA initialization
has completed in the kernel, can call this interface to obtain their
device NUMA topology from ACPI tables.  Such drivers do not have to
deal with offline nodes.  A node may be offline when a device
proximity ID is unique, SRAT memory entry does not exist, or NUMA is
disabled, ex. "numa=off" on x86.

This patch also moves the pxm range check from acpi_get_node() to
acpi_map_pxm_to_node().
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com&gt;>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

99759869

libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820

由 Dan Williams 提交于 6月 23, 2015

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

58138820

libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1

由 Ross Zwisler 提交于 6月 25, 2015

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

047fc8a1

nd_btt: atomic sector updates · 5212e11f

由 Vishal Verma 提交于 6月 25, 2015

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5212e11f

Revert "block, dm: don't copy bios for request clones" · 78d8e58a

由 Mike Snitzer 提交于 6月 26, 2015

This reverts commit 5f1b670d.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

78d8e58a

lib/string.c: introduce strreplace() · 94df2904

由 Rasmus Villemoes 提交于 6月 25, 2015

Strings are sometimes sanitized by replacing a certain character (often
'/') by another (often '!').  In a few places, this is done the same way
Schlemiel the Painter would do it.  Others are slightly smarter but still
do multiple strchr() calls.  Introduce strreplace() to do this using a
single function call and a single pass over the string.

One would expect the return value to be one of three things: void, s, or
the number of replacements made.  I chose the fourth, returning a pointer
to the end of the string.  This is more likely to be useful (for example
allowing the caller to avoid a strlen call).
Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

94df2904