提交 · 6dd53aa4563a2c69e80a24d2cc68d484b5ea2891 · OpenHarmony / kernel_linux

23 7月, 2012 6 次提交

deal with task_work callbacks adding more work · a2d4c71d

由 Al Viro 提交于 6月 27, 2012

It doesn't matter on normal return to userland path (we'll recheck the
NOTIFY_RESUME flag anyway), but in case of exit_task_work() we'll
need that as soon as we get callbacks capable of triggering more
task_work_add().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a2d4c71d

move exit_task_work() past exit_files() et.al. · ed3e694d

由 Al Viro 提交于 6月 27, 2012

... and get rid of PF_EXITING check in task_work_add().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ed3e694d

merge task_work and rcu_head, get rid of separate allocation for keyring case · 67d12145

由 Al Viro 提交于 6月 27, 2012

task_work and rcu_head are identical now; merge them (calling the result
struct callback_head, rcu_head #define'd to it), kill separate allocation
in security/keys since we can just use cred->rcu now.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

67d12145

trim task_work: get rid of hlist · 158e1645

由 Al Viro 提交于 6月 27, 2012

layout based on Oleg's suggestion; single-linked list,
task->task_works points to the last element, forward pointer
from said last element points to head.  I'd still prefer
much more regular scheme with two pointers in task_work,
but...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

158e1645

trimming task_work: kill ->data · 41f9d29f

由 Al Viro 提交于 6月 26, 2012

get rid of the only user of ->data; this is _not_ the final variant - in the
end we'll have task_work and rcu_head identical and just use cred->rcu,
at which point the separate allocation will be gone completely.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

41f9d29f

A
signal: make sure we don't get stopped with pending task_work · 72667028
由 Al Viro 提交于 7月 15, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
72667028

22 7月, 2012 4 次提交

kdb: Switch to nolock variants of kmsg_dump functions · c064da47

由 Anton Vorontsov 提交于 7月 20, 2012

The locked variants are prone to deadlocks (suppose we got to the
debugger w/ the logbuf lock held), so let's switch to nolock variants.
Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c064da47

printk: Implement some unlocked kmsg_dump functions · 533827c9

由 Anton Vorontsov 提交于 7月 20, 2012

If used from KDB, the locked variants are prone to deadlocks (suppose we
got to the debugger w/ the logbuf lock held).

So, we have to implement a few routines that grab no logbuf lock.

Yet we don't need these functions in modules, so we don't export them.
Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

533827c9

printk: Remove kdb_syslog_data · 1b499d05

由 Anton Vorontsov 提交于 7月 20, 2012

The function is no longer needed, so remove it.
Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1b499d05

kdb: Revive dmesg command · bc792e61

由 Anton Vorontsov 提交于 7月 20, 2012

The kgdb dmesg command is broken after the printk rework.  The old logic
in kdb code makes no sense in terms of current printk/logging storage
format, and KDB simply hangs forever.

This patch revives the command by switching to kmsg_dumper iterator.

The code is now much more simpler and shorter.
Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc792e61

19 7月, 2012 2 次提交

Make wait_for_device_probe() also do scsi_complete_async_scans() · eea03c20

由 Linus Torvalds 提交于 7月 18, 2012

Commit a7a20d10 ("sd: limit the scope of the async probe domain")
make the SCSI device probing run device discovery in it's own async
domain.

However, as a result, the partition detection was no longer synchronized
by async_synchronize_full() (which, despite the name, only synchronizes
the global async space, not all of them).  Which in turn meant that
"wait_for_device_probe()" would not wait for the SCSI partitions to be
parsed.

And "wait_for_device_probe()" was what the boot time init code relied on
for mounting the root filesystem.

Now, most people never noticed this, because not only is it
timing-dependent, but modern distributions all use initrd.  So the root
filesystem isn't actually on a disk at all.  And then before they
actually mount the final disk filesystem, they will have loaded the
scsi-wait-scan module, which not only does the expected
wait_for_device_probe(), but also does scsi_complete_async_scans().

[ Side note: scsi_complete_async_scans() had also been partially broken,
  but that was fixed in commit 43a8d39d ("fix async probe
  regression"), so that same commit a7a20d10 had actually broken
  setups even if you used scsi-wait-scan explicitly ]

Solve this problem by just moving the scsi_complete_async_scans() call
into wait_for_device_probe().  Everybody who wants to wait for device
probing to finish really wants the SCSI probing to complete, so there's
no reason not to do this.

So now "wait_for_device_probe()" really does what the name implies, and
properly waits for device probing to finish.  This also removes the now
unnecessary extra calls to scsi_complete_async_scans().
Reported-and-tested-by: NArtem S. Tashkinov <t.artem@mailcity.com>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: linux-scsi <linux-scsi@vger.kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eea03c20

PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock · 11388c87

由 Rafael J. Wysocki 提交于 7月 19, 2012

Require processes wanting to use the wake_lock/wake_unlock sysfs
files to have the CAP_BLOCK_SUSPEND capability, which also is
required for the eventpoll EPOLLWAKEUP flag to be effective, so that
all interfaces related to blocking autosleep depend on the same
capability.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
Acked-by: NMichael Kerrisk <mtk.man-pages@gmail.com>

11388c87

17 7月, 2012 1 次提交

timekeeping: Add missing update call in timekeeping_resume() · 3e997130

由 Thomas Gleixner 提交于 7月 16, 2012

The leap second rework unearthed another issue of inconsistent data.

On timekeeping_resume() the timekeeper data is updated, but nothing
calls timekeeping_update(), so now the update code in the timer
interrupt sees stale values.

This has been the case before those changes, but then the timer
interrupt was using stale data as well so this went unnoticed for quite
some time.

Add the missing update call, so all the data is consistent everywhere.
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Reported-and-tested-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
Reported-and-tested-by: NMartin Steigerwald <Martin@lichtvoll.de>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Linux PM list <linux-pm@vger.kernel.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3e997130

15 7月, 2012 8 次提交

time: Rework timekeeping functions to take timekeeper ptr as argument · f726a697

由 John Stultz 提交于 7月 13, 2012

As part of cleaning up the timekeeping code, this patch converts
a number of internal functions to takei a timekeeper ptr as an
argument, so that the internal functions don't access the global
timekeeper structure directly. This allows for further optimizations
to reduce lock hold time later.

This patch has been updated to include more consistent usage of the
timekeeper value, by making sure it is always passed as a argument
to non top-level functions.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-9-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f726a697

time: Move xtime_nsec adjustment underflow handling timekeeping_adjust · 2a8c0883

由 John Stultz 提交于 7月 13, 2012

When we make adjustments speeding up the clock, its possible
for xtime_nsec to underflow. We already handle this properly,
but we do so from update_wall_time() instead of the more logical
timekeeping_adjust(), where the possible underflow actually
occurs.

Thus, move the correction logic to the timekeeping_adjust, which
is the function that causes the issue. Making update_wall_time()
more readable.
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-8-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

2a8c0883

time: Move arch_gettimeoffset() usage into timekeeping_get_ns() · f2a5a085

由 John Stultz 提交于 7月 13, 2012

Since we call arch_gettimeoffset() in all the accessor
functions, move arch_gettimeoffset() calls into
timekeeping_get_ns() and timekeeping_get_ns_raw() to simplify
the code.

This also makes the code easier to maintain as we don't have to
worry about forgetting the arch_gettimeoffset() as has happened
in the past.
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-7-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f2a5a085

time: Refactor accumulation of nsecs to secs · 1f4f9487

由 John Stultz 提交于 7月 13, 2012

We do the exact same logic moving nsecs to secs in the
timekeeper in multiple places, so condense this into a
single function.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-6-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1f4f9487

time: Condense timekeeper.xtime into xtime_sec · 1e75fa8b

由 John Stultz 提交于 7月 13, 2012

The timekeeper struct has a xtime_nsec, which keeps the
sub-nanosecond remainder.  This ends up being somewhat
duplicative of the timekeeper.xtime.tv_nsec value, and we
have to do extra work to keep them apart, copying the full
nsec portion out and back in over and over.

This patch simplifies some of the logic by taking the timekeeper
xtime value and splitting it into timekeeper.xtime_sec and
reuses the timekeeper.xtime_nsec for the sub-second portion
(stored in higher res shifted nanoseconds).

This simplifies some of the accumulation logic. And will
allow for more accurate timekeeping once the vsyscall code
is updated to use the shifted nanosecond remainder.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1e75fa8b

time: Explicitly use u32 instead of int for shift values · fee84c43

由 John Stultz 提交于 7月 13, 2012

Ingo noted that using a u32 instead of int for shift values
would be better to make sure the compiler doesn't unnecessarily
use complex signed arithmetic.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

fee84c43

time: Whitespace cleanups per Ingo%27s requests · 42e71e81

由 John Stultz 提交于 7月 13, 2012

Ingo noted a number of places where there is inconsistent
use of whitespace. This patch tries to address the main
culprits.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

42e71e81

ntp: Fix STA_INS/DEL clearing bug · 6b1859db

由 John Stultz 提交于 7月 13, 2012

In commit 6b43ae8a, I
introduced a bug that kept the STA_INS or STA_DEL bit
from being cleared from time_status via adjtimex()
without forcing STA_PLL first.

Usually once the STA_INS is set, it isn't cleared
until the leap second is applied, so its unlikely this
affected anyone. However during testing I noticed it
took some effort to cancel a leap second once STA_INS
was set.
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
CC: stable@vger.kernel.org # 3.4
Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

6b1859db

14 7月, 2012 4 次提交

VFS: Pass mount flags to sget() · 9249e17f

由 David Howells 提交于 6月 25, 2012

Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9249e17f

VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors · be34d1a3

由 David Howells 提交于 6月 25, 2012

copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM.  Change it to return
an explicit error.

Also change clone_mnt() for consistency and because union mounts will add new
error cases.

Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix.
[AV: folded braino fix by Dan Carpenter]

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Cc: Valerie Aurora <valerie.aurora@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

be34d1a3

get rid of kern_path_parent() · 79714f72

由 Al Viro 提交于 6月 15, 2012

all callers want the same thing, actually - a kinda-sorta analog of
kern_path_create().  I.e. they want parent vfsmount/dentry (with
->i_mutex held, to make sure the child dentry is still their child)
+ the child dentry.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>

79714f72

stop passing nameidata to ->lookup() · 00cd8dd3

由 Al Viro 提交于 6月 10, 2012

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

00cd8dd3

12 7月, 2012 8 次提交

tracing: Check for allocation failure in __tracing_open() · 93574fcc

由 Dan Carpenter 提交于 7月 11, 2012

Clean up and return -ENOMEM on if the kzalloc() fails.

This also prevents a potential crash, as the pointer that failed to
allocate would be later used.

Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

93574fcc

c/r: prctl: less paranoid prctl_set_mm_exe_file() · 4229fb1d

由 Konstantin Khlebnikov 提交于 7月 11, 2012

"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.

Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.

Plus, this patch fixes exit code in case operation success.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
Tested-by: NCyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4229fb1d

hrtimer: Update hrtimer base offsets each hrtimer_interrupt · 5baefd6d

由 John Stultz 提交于 7月 10, 2012

The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.

clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.

In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.

So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.

ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().

The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.

This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

5baefd6d

timekeeping: Provide hrtimer update function · f6c06abf

由 Thomas Gleixner 提交于 7月 10, 2012

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f6c06abf

hrtimers: Move lock held region in hrtimer_interrupt() · 196951e9

由 Thomas Gleixner 提交于 7月 10, 2012

We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

196951e9

timekeeping: Maintain ktime_t based offsets for hrtimers · 5b9fe759

由 Thomas Gleixner 提交于 7月 10, 2012

We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

5b9fe759

timekeeping: Fix leapsecond triggered load spike issue · 4873fa07

由 John Stultz 提交于 7月 10, 2012

The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.

The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reported-by: NJan Engelhardt <jengelh@inai.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

4873fa07

hrtimer: Provide clock_was_set_delayed() · f55a6faa

由 John Stultz 提交于 7月 10, 2012

clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().

For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.

[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
  rid of all this ifdeffery ASAP ]
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Reported-by: NJan Engelhardt <jengelh@inai.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPrarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f55a6faa

10 7月, 2012 2 次提交

kmsg: merge continuation records while printing · 5becfb1d

由 Kay Sievers 提交于 7月 09, 2012

In (the unlikely) case our continuation merge buffer is busy, we unfortunately
can not merge further continuation printk()s into a single record and have to
store them separately, which leads to split-up output of these lines when they
are printed.

Add some flags about newlines and prefix existence to these records and try to
reconstruct the full line again, when the separated records are printed.
Reported-By: NMichael Neuling <mikey@neuling.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Tested-By: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NKay Sievers <kay@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

5becfb1d

kmsg: /proc/kmsg - support reading of partial log records · eb02dac9

由 Kay Sievers 提交于 7月 09, 2012

Restore support for partial reads of any size on /proc/kmsg, in case the
supplied read buffer is smaller than the record size.

Some people seem to think is is ia good idea to run:
  $ dd if=/proc/kmsg bs=1 of=...
as a klog bridge.

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44211Reported-by: NJukka Ollila <jiiksteri@gmail.com>
Signed-off-by: NKay Sievers <kay@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

eb02dac9

08 7月, 2012 2 次提交

cgroup: fix cgroup hierarchy umount race · 5db9a4d9

由 Tejun Heo 提交于 7月 07, 2012

48ddbe19 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed.  As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code.  As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.

Unfortunately, after 48ddbe19, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().

This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is.  If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().

Before 48ddbe19, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.

Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe19
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().
Signed-off-by: NTejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: Nshyju pv <shyju.pv@huawei.com>
Tested-by: Nshyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

5db9a4d9

Revert "cgroup: superblock can't be released with active dentries" · 7db5b3ca

由 Tejun Heo 提交于 7月 07, 2012

This reverts commit fa980ca8.  The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup.  While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.

Revert the incorrect fix.  The next commit will describe the race and
fix it correctly.
Signed-off-by: NTejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: Nshyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

7db5b3ca

07 7月, 2012 3 次提交

kmsg: make sure all messages reach a newly registered boot console · 68b6507d

由 Kay Sievers 提交于 7月 06, 2012

We suppress printing kmsg records to the console, which are already printed
immediately while we have received their fragments.

Newly registered boot consoles print the entire kmsg buffer during
registration. Clear the console-suppress flag after we skipped the record
during its first storage, so any later print will see these records as usual.
Signed-off-by: NKay Sievers <kay@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

68b6507d

kmsg: properly handle concurrent non-blocking read() from /proc/kmsg · cb424ffe

由 Kay Sievers 提交于 7月 06, 2012

The /proc/kmsg read() interface is internally simply wired up to a sequence
of syslog() syscalls, which might are racy between their checks and actions,
regarding concurrency.

In the (very uncommon) case of concurrent readers of /dev/kmsg, relying on
usual O_NONBLOCK behavior, the recently introduced mutex might block an
O_NONBLOCK reader in read(), when poll() returns for it, but another process
has already read the data in the meantime. We've seen that while running
artificial test setups and tools that "fight" about /proc/kmsg data.

This restores the original /proc/kmsg behavior, where in case of concurrent
read()s, poll() might wake up but the read() syscall will just return 0 to
the caller, while another process has "stolen" the data.

This is in the general case not the expected behavior, but it is the exact
same one, that can easily be triggered with a 3.4 kernel, and some tools
might just rely on it.

The mutex is not needed, the original integrity issue which introduced it,
is in the meantime covered by:
  "fill buffer with more than a single message for SYSLOG_ACTION_READ"
  116e90b2

Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NKay Sievers <kay@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

cb424ffe

kmsg: add the facility number to the syslog prefix · 43a73a50

由 Kay Sievers 提交于 7月 06, 2012

After the recent split of facility and level into separate variables,
we miss the facility value (always 0 for kernel-originated messages)
in the syslog prefix.

On Tue, Jul 3, 2012 at 12:45 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> Static checkers complain about the impossible condition here.
>
> In 084681d1 ('printk: flush continuation lines immediately to
> console'), we changed msg->level from being a u16 to being an unsigned
> 3 bit bitfield.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NKay Sievers <kay@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

43a73a50

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多