提交 · a9327cac440be4d8333bba975cbbf76045096275 · openeuler / Kernel

14 9月, 2009 1 次提交

Seperate read and write statistics of in_flight requests · a9327cac

由 Nikanth Karthikesan 提交于 9月 11, 2009

Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a9327cac

11 9月, 2009 8 次提交

block: enable rq CPU completion affinity by default · 01e97f6b

由 Jens Axboe 提交于 9月 03, 2009

Test results here look good, and on big OLTP runs it has also shown
to significantly increase cycles attributed to the database and
cause a performance boost.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

01e97f6b

block: make blk_iopoll_prep_sched() follow normal 0/1 return convention · d62f843b

由 Jens Axboe 提交于 8月 11, 2009

Return 0 if we successfully marked this iopoll structure as ours for
scheduling, instead of 1.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

d62f843b

block: add blk-iopoll, a NAPI like approach for block devices · 5e605b64

由 Jens Axboe 提交于 8月 05, 2009

This borrows some code from NAPI and implements a polled completion
mode for block devices. The idea is the same as NAPI - instead of
doing the command completion when the irq occurs, schedule a dedicated
softirq in the hopes that we will complete more IO when the iopoll
handler is invoked. Devices have a budget of commands assigned, and will
stay in polled mode as long as they continue to consume their budget
from the iopoll softirq handler. If they do not, the device is set back
to interrupt completion mode.

This patch holds the core bits for blk-iopoll, device driver support
sold separately.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5e605b64

block: improve queue_should_plug() by looking at IO depths · fb1e7538

由 Jens Axboe 提交于 7月 30, 2009

Instead of just checking whether this device uses block layer
tagging, we can improve the detection by looking at the maximum
queue depth it has reached. If that crosses 4, then deem it a
queuing device.

This is important on high IOPS devices, since plugging hurts
the performance there (it can be as much as 10-15% of the sys
time).
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

fb1e7538

bio: first step in sanitizing the bio->bi_rw flag testing · 1f98a13f

由 Jens Axboe 提交于 9月 11, 2009

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1f98a13f

block: make bio_rw_flagged() return a bool · e7e503ae

由 Jens Axboe 提交于 7月 28, 2009

Makes for a saner interface, instead of returning the bit position.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e7e503ae

block: implement mixed merge of different failfast requests · 80a761fd

由 Tejun Heo 提交于 7月 03, 2009

Failfast has characteristics from other attributes.  When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference.  It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.

This patch introduces the concept of 'mixed merge'.  A request is a
mixed merge if it is merge of segments which require different
handling on failure.  Currently the only mixable attributes are
failfast ones (or lack thereof).

When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed.  Each bio carries failfast settings
and the request always tracks failfast state of the first bio.  When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.

This allows request merging regardless of failfast settings while
keeping the failure handling correct.

This patch only implements mixed merge but doesn't enable it.  The
next one will update SCSI to make use of mixed merge.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

80a761fd

block: use the same failfast bits for bio and request · a82afdfc

由 Tejun Heo 提交于 7月 03, 2009

bio and request use the same set of failfast bits.  This patch makes
the following changes to simplify things.

* enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
  bits coincide with __REQ_FAILFAST_* bits.

* The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
  but the matching is useless anyway.  init_request_from_bio() is
  responsible for setting FAILFAST bits on FS requests and non-FS
  requests never use BIO_RW_AHEAD.  Drop the code and comment from
  blk_rq_bio_prep().

* Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
  simplify FAILFAST flags handling in init_request_from_bio().
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a82afdfc

09 9月, 2009 2 次提交

shmfs: use 'check_acl' instead of 'permission' · 6d848a48

由 Linus Torvalds 提交于 8月 28, 2009

shmfs wants purely standard POSIX ACL semantics, so we can use the new
generic VFS layer POSIX ACL checking rather than cooking our own
'permission()' function.
Reviewed-by: NJames Morris <jmorris@namei.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d848a48

Make 'check_acl()' a first-class filesystem op · 5909ccaa

由 Linus Torvalds 提交于 8月 28, 2009

This is stage one in flattening out the callchains for the common
permission testing. Rather than have most filesystem implement their
own inode->i_op->permission function that just calls back down to the
VFS layers 'generic_permission()' with the per-filesystem ACL checking
function, the filesystem can just expose its 'check_acl' function
directly, and let the VFS layer do everything for it.

This is all just preparatory - no filesystem actually enables this yet.
Reviewed-by: NJames Morris <jmorris@namei.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5909ccaa

06 9月, 2009 2 次提交

exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c

由 Oleg Nesterov 提交于 9月 05, 2009

Tom Horsley reports that his debugger hangs when it tries to read
/proc/pid_of_tracee/maps, this happens since

	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
	04b836cbf19e885f8366bccb2e4b0474346c02d

commit in 2.6.31.

But the root of the problem lies in the fact that do_execve() path calls
tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.

The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
another task doing PTRACE_ATTACH should not hang until it is killed or the
tracee resumes.

With this patch do_execve() does not use ->cred_guard_mutex directly and
we do not hold it throughout, instead:

	- introduce prepare_bprm_creds() helper, it locks the mutex
	  and calls prepare_exec_creds() to initialize bprm->cred.

	- install_exec_creds() drops the mutex after commit_creds(),
	  and thus before tracehook_report_exec()->ptrace_stop().

	  or, if exec fails,

	  free_bprm() drops this mutex when bprm->cred != NULL which
	  indicates install_exec_creds() was not called.
Reported-by: NTom Horsley <tom.horsley@att.net>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2a8474c

workqueues: introduce __cancel_delayed_work() · 4e49627b

由 Oleg Nesterov 提交于 9月 05, 2009

cancel_delayed_work() has to use del_timer_sync() to guarantee the timer
function is not running after return.  But most users doesn't actually
need this, and del_timer_sync() has problems: it is not useable from
interrupt, and it depends on every lock which could be taken from irq.

Introduce __cancel_delayed_work() which calls del_timer() instead.

The immediate reason for this patch is
http://bugzilla.kernel.org/show_bug.cgi?id=13757
but hopefully this helper makes sense anyway.

As for 13757 bug, actually we need requeue_delayed_work(), but its
semantics are not yet clear.

Merge this patch early to resolves cross-tree interdependencies between
input and infiniband.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e49627b

05 9月, 2009 2 次提交

dm log: userspace add luid to distinguish between concurrent log instances · 7ec23d50

由 Jonathan Brassow 提交于 9月 04, 2009

Device-mapper userspace logs (like the clustered log) are
identified by a universally unique identifier (UUID).  This
identifier is used to associate requests from the kernel to
a specific log in userspace.  The UUID must be unique everywhere,
since multiple machines may use this identifier when communicating
about a particular log, as is the case for cluster logs.

Sometimes, device-mapper/LVM may re-use a UUID.  This is the
case during pvmoves, when moving from one segment of an LV
to another, or when resizing a mirror, etc.  In these cases,
a new log is created with the same UUID and loaded in the
"inactive" slot.  When a device-mapper "resume" is issued,
the "live" table is deactivated and the new "inactive" table
becomes "live".  (The "inactive" table can also be removed
via a device-mapper 'clear' command.)

The above two issues were colliding.  More than one log was being
created with the same UUID, and there was no way to distinguish
between them.  So, sometimes the wrong log would be swapped
out during the exchange.

The solution is to create a locally unique identifier,
'luid', to go along with the UUID.  This new identifier is used
to determine exactly which log is being referenced by the kernel
when the log exchange is made.  The identifier is not
universally safe, but it does not need to be, since
create/destroy/suspend/resume operations are bound to a specific
machine; and these are the operations that make up the exchange.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

7ec23d50

dm stripe: expose correct io hints · 40bea431

由 Mike Snitzer 提交于 9月 04, 2009

Set sensible I/O hints for striped DM devices in the topology
infrastructure added for 2.6.31 for userspace tools to
obtain via sysfs.

Add .io_hints to 'struct target_type' to allow the I/O hints portion
(io_min and io_opt) of the 'struct queue_limits' to be set by each
target and implement this for dm-stripe.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

40bea431

01 9月, 2009 1 次提交

lmb: Also remove __init from lmb_end_of_RAM() declaration in lmb.h · 1a37f184

由 Benjamin Herrenschmidt 提交于 8月 31, 2009

My previous patch (commit 4f8ee2c9: "lmb: Remove __init from
lmb_end_of_DRAM()") removed __init in lmb.c but missed the fact that it
was also marked as such in the .h
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1a37f184

27 8月, 2009 2 次提交

flex_array: convert element_nr formals to unsigned · b62e408c

由 David Rientjes 提交于 8月 26, 2009

It's problematic to allow signed element_nr's or total's to be passed as
part of the flex array API.

flex_array_alloc() allows total_nr_elements to be set to a negative
quantity, which is obviously erroneous.

flex_array_get() and flex_array_put() allows negative array indices in
dereferencing an array part, which could address memory mapped before
struct flex_array.

The fix is to convert all existing element_nr formals to be qualified as
unsigned.  Existing checks to compare it to total_nr_elements or the max
array size based on element_size need not be changed.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b62e408c

flex_array: declare parts member to have incomplete type · 8e7ee270

由 David Rientjes 提交于 8月 26, 2009

The `parts' member of struct flex_array should evaluate to an incomplete
type so that sizeof() cannot be used and C99 does not require the
zero-length specification.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e7ee270

25 8月, 2009 1 次提交

mm: fix hugetlb bug due to user_shm_unlock call · 353d5c30

由 Hugh Dickins 提交于 8月 24, 2009

2.6.30's commit 8a0bdec1 removed
user_shm_lock() calls in hugetlb_file_setup() but left the
user_shm_unlock call in shm_destroy().

In detail:
Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock()
is not called in hugetlb_file_setup(). However, user_shm_unlock() is
called in any case in shm_destroy() and in the following
atomic_dec_and_lock(&up->__count) in free_uid() is executed and if
up->__count gets zero, also cleanup_user_struct() is scheduled.

Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set.
However, the ref counter up->__count gets unexpectedly non-positive and
the corresponding structs are freed even though there are live
references to them, resulting in a kernel oops after a lots of
shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set.

Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the
time of shm_destroy() may give a different answer from at the time
of hugetlb_file_setup().  And fixed newseg()'s no_id error path,
which has missed user_shm_unlock() ever since it came in 2.6.9.
Reported-by: NStefan Huber <shuber2@gmail.com>
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Tested-by: NStefan Huber <shuber2@gmail.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

353d5c30

24 8月, 2009 1 次提交

kernel_read: redefine offset type · 6777d773

由 Mimi Zohar 提交于 8月 21, 2009

vfs_read() offset is defined as loff_t, but kernel_read()
offset is only defined as unsigned long. Redefine
kernel_read() offset as loff_t.

Cc: stable@kernel.org
Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

6777d773

22 8月, 2009 1 次提交

Make bitmask 'and' operators return a result code · f4b0373b

由 Linus Torvalds 提交于 8月 21, 2009

When 'and'ing two bitmasks (where 'andnot' is a variation on it), some
cases want to know whether the result is the empty set or not. In
particular, the TLB IPI sending code wants to do cpumask operations and
determine if there are any CPU's left in the final set.

So this just makes the bitmask (and cpumask) functions return a boolean
for whether the result has any bits set.

Cc: stable@kernel.org (2.6.30, needed by TLB shootdown fix)
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4b0373b

21 8月, 2009 1 次提交

Input: ucb1400_ts - enable ADC Filter · 1700f5fd

由 Marek Vasut 提交于 8月 20, 2009

This patch enables ADC filtering on UCB1400 codec by default. The
benefit from this change is mostly on some Colibri boards where
the ADCSYNC pin of the UCB1400 codec isn't connected causing the
touchscreen to jitter very badly. This change has no visible
effect on boards where the ADCSYNC pin is connected.
Signed-off-by: NMarek Vasut <marek.vasut@gmail.com>
Tested-by: NPalo Revak <palo@bielyvlk.sk>
Signed-off-by: NDmitry Torokhov <dtor@mail.ru>

1700f5fd

19 8月, 2009 1 次提交

mm: revert "oom: move oom_adj value" · 0753ba01

由 KOSAKI Motohiro 提交于 8月 18, 2009

The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct.  It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler.  Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd");
	}
	....

vfork() parent and child are shared the same mm_struct.  then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 81236810 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b (mm: copy over oom_adj value at fork time)
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0753ba01

18 8月, 2009 1 次提交

net: restore gnet_stats_basic to previous definition · c1a8f1f1

由 Eric Dumazet 提交于 8月 16, 2009

In 5e140dfc "net: reorder struct Qdisc
for better SMP performance" the definition of struct gnet_stats_basic
changed incompatibly, as copies of this struct are shipped to
userland via netlink.

Restoring old behavior is not welcome, for performance reason.

Fix is to use a private structure for kernel, and
teach gnet_stats_copy_basic() to convert from kernel to user land,
using legacy structure (struct gnet_stats_basic)

Based on a report and initial patch from Michael Spang.
Reported-by: NMichael Spang <mspang@csclub.uwaterloo.ca>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1a8f1f1

17 8月, 2009 3 次提交

security: define round_hint_to_min in !CONFIG_SECURITY · 1d995973

由 Eric Paris 提交于 8月 07, 2009

Fix the header files to define round_hint_to_min() and to define
mmap_min_addr_handler() in the !CONFIG_SECURITY case.

Built and tested with !CONFIG_SECURITY
Signed-off-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

1d995973

Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab

由 Eric Paris 提交于 7月 31, 2009

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable.  This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.
Signed-off-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

788084ab

Capabilities: move cap_file_mmap to commoncap.c · 9c0d9010

由 Eric Paris 提交于 7月 31, 2009

Currently we duplicate the mmap_min_addr test in cap_file_mmap and in
security_file_mmap if !CONFIG_SECURITY.  This patch moves cap_file_mmap
into commoncap.c and then calls that function directly from
security_file_mmap ifndef CONFIG_SECURITY like all of the other capability
checks are done.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

9c0d9010

13 8月, 2009 2 次提交

perf: Rework/fix the whole read vs group stuff · 3dab77fb

由 Peter Zijlstra 提交于 8月 13, 2009

Replace PERF_SAMPLE_GROUP with PERF_SAMPLE_READ and introduce
PERF_FORMAT_GROUP to deal with group reads in a more generic
way.

This allows you to get group reads out of read() as well.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey J Ashford <cjashfor@us.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: stephane eranian <eranian@googlemail.com>
LKML-Reference: <20090813103655.117411814@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3dab77fb

perf_counter: Provide hw_perf_counter_setup_online() APIs · 28402971

由 Ingo Molnar 提交于 8月 13, 2009

Provide weak aliases for hw_perf_counter_setup_online(). This is
used by the BTS patches (for v2.6.32), but it interacts with
fixes so propagate this upstream. (it has no effect as of yet)

Also export perf_counter_output() to architecture code.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

28402971

12 8月, 2009 1 次提交

NFS: Fix an O_DIRECT Oops... · 1ae88b2e

由 Trond Myklebust 提交于 8月 12, 2009

We can't call nfs_readdata_release()/nfs_writedata_release() without
first initialising and referencing args.context. Doing so inside
nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
causes an Oops.

We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
those cases.

Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
referencing the nfs_open_context for us. Since the readdata and writedata
structures carry a reference to that, we can simplify things by getting rid
of the extra nfs_open_context references, so that we can replace all
instances of nfs_readdata_release()/nfs_writedata_release().
Reported-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1ae88b2e

10 8月, 2009 2 次提交

locking, sched: Give waitqueue spinlocks their own lockdep classes · 2fc39111

由 Peter Zijlstra 提交于 8月 10, 2009

Give waitqueue spinlocks their own lockdep classes when they
are initialised from init_waitqueue_head().  This means that
struct wait_queue::func functions can operate other waitqueues.

This is used by CacheFiles to catch the page from a backing fs
being unlocked and to wake up another thread to take a copy of
it.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NTakashi Iwai <tiwai@suse.de>
Cc: linux-cachefs@redhat.com
Cc: torvalds@osdl.org
Cc: akpm@linux-foundation.org
LKML-Reference: <20090810113305.17284.81508.stgit@warthog.procyon.org.uk>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2fc39111

perf_counter: Correct PERF_SAMPLE_RAW output · a044560c

由 Peter Zijlstra 提交于 8月 10, 2009

PERF_SAMPLE_* output switches should unconditionally output the
correct format, as they are the only way to unambiguously parse
the PERF_EVENT_SAMPLE data.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1249896447.17467.74.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a044560c

09 8月, 2009 2 次提交

perf_counter: Fix tracepoint sampling to be part of generic sampling · 3a43ce68

由 Frederic Weisbecker 提交于 8月 08, 2009

Based on Peter's comments, make tracepoint sampling generic
just like all the other sampling bits are. This is a rename
with no code changes:

- PERF_SAMPLE_TP_RECORD to PERF_SAMPLE_RAW
- struct perf_tracepoint_record to perf_raw_record

We want the system in place that transport tracepoints raw
samples events into the perf ring buffer to be generalized and
usable by any type of counter.

Reported-by; Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1249698400-5441-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3a43ce68

perf_counter: Fix/complete ftrace event records sampling · f413cdb8

由 Frederic Weisbecker 提交于 8月 07, 2009

This patch implements the kernel side support for ftrace event
record sampling.

A new counter sampling attribute is added:

   PERF_SAMPLE_TP_RECORD

which requests ftrace events record sampling. In this case
if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
fires, we emit the tracepoint binary record to the
perfcounter event buffer, as a sample.

Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
record:

 perf record -f -F 1 -a -e workqueue:workqueue_execution
 perf report -D

 0x21e18 [0x48]: event: 9
 .
 . ... raw event: size 72 bytes
 .  0000:  09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff  ......H........
 .  0010:  0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00  ........!......
 .  0020:  2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e  +...........eve
 .  0030:  74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00  ts/1...........
 .  0040:  e0 b1 31 81 ff ff ff ff                          .......
.
0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33

The raw ftrace binary record starts at offset 0020.

Translation:

 struct trace_entry {
	type		= 0x2b = 43;
	flags		= 1;
	preempt_count	= 2;
	pid		= 0xa = 10;
	tgid		= 0xa = 10;
 }

 thread_comm = "events/1"
 thread_pid  = 0xa = 10;
 func	    = 0xffffffff8131b1e0 = flush_to_ldisc()

What will come next?

 - Userspace support ('perf trace'), 'flight data recorder' mode
   for perf trace, etc.

 - The unconditional copy from the profiling callback brings
   some costs however if someone wants no such sampling to
   occur, and needs to be fixed in the future. For that we need
   to have an instant access to the perf counter attribute.
   This is a matter of a flag to add in the struct ftrace_event.

 - Take care of the events recursivity! Don't ever try to record
   a lock event for example, it seems some locking is used in
   the profiling fast path and lead to a tracing recursivity.
   That will be fixed using raw spinlock or recursivity
   protection.

 - [...]

 - Profit! :-)
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f413cdb8

08 8月, 2009 4 次提交

bzip2/lzma/gzip: fix comments describing decompressor API · daeb6b6f

由 Phillip Lougher 提交于 8月 06, 2009

Fix and improve comments in decompress/generic.h that describe the
decompressor API.  Also remove an unused definition, and rename INBUF_LEN
in lib/decompress_inflate.c to conform to bzip2/lzma naming.
Signed-off-by: NPhillip Lougher <phillip@lougher.demon.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

daeb6b6f

mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware · 4bfc4495

由 KAMEZAWA Hiroyuki 提交于 8月 06, 2009

At first, init_task's mems_allowed is initialized as this.
 init_task->mems_allowed == node_state[N_POSSIBLE]

And cpuset's top_cpuset mask is initialized as this
 top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

Before 2.6.29:
policy's mems_allowed is initialized as this.

  1. update tasks->mems_allowed by its cpuset->mems_allowed.
  2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

Updating task's mems_allowed in reference to top_cpuset's one.
cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

In 2.6.30: After commit 58568d2a
("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
is initialized as this.

  1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

Here, if task is in top_cpuset, task->mems_allowed is not updated from
init's one.  Assume user excutes command as #numactrl --interleave=all
,....

  policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

Then, policy's mems_allowd can includes a possible node, which has no pgdat.

MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
directly.

  NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

Then, what's we need is making policy->mems_allowed be aware of
N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
be on statck.  Because I know cpumask has a new interface of
CPUMASK_ALLOC(), I added it to node.

This patch stands on old behavior.  But I feel this fix itself is just a
Band-Aid.  But to do fundametal fix, we have to take care of memory
hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
think.)

mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
should be includes only online nodes.

In old behavior, this is guaranteed by frequent reference to cpuset's
code.  Now, most of them are removed and mempolicy has to check it by
itself.

To do check, a few nodemask_t will be used for calculating nodemask.  But,
size of nodemask_t can be big and it's not good to allocate them on stack.

Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
NODEMASK_ALLOC/FREE shoudl be there.

[akpm@linux-foundation.org: cleanups & tweaks]
Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4bfc4495

vfs: add __destroy_inode · 2e00c97e

由 Christoph Hellwig 提交于 8月 07, 2009

When we want to tear down an inode that lost the add to the cache race
in XFS we must not call into ->destroy_inode because that would delete
the inode that won the race from the inode cache radix tree.

This patch provides the __destroy_inode helper needed to fix this,
the actual fix will be in th next patch. As XFS was the only reason
destroy_inode was exported we shift the export to the new __destroy_inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Sandeen <sandeen@sandeen.net>

2e00c97e

vfs: fix inode_init_always calling convention · 54e34621

由 Christoph Hellwig 提交于 8月 07, 2009

Currently inode_init_always calls into ->destroy_inode if the additional
initialization fails. That's not only counter-intuitive because
inode_init_always did not allocate the inode structure, but in case of
XFS it's actively harmful as ->destroy_inode might delete the inode from
a radix-tree that has never been added. This in turn might end up
deleting the inode for the same inum that has been instanciated by
another process and cause lots of cause subtile problems.

Also in the case of re-initializing a reclaimable inode in XFS it would
free an inode we still want to keep alive.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Sandeen <sandeen@sandeen.net>

54e34621

06 8月, 2009 2 次提交

Input: matrix_keypad - make matrix keymap size dynamic · d82f1c35

由 Eric Miao 提交于 8月 05, 2009

Remove assumption on the shift and size of rows/columns form
matrix_keypad driver.
Signed-off-by: NEric Miao <eric.y.miao@gmail.com>
Signed-off-by: NDmitry Torokhov <dtor@mail.ru>

d82f1c35

ftrace: Fix perf-tracepoint OOPS · af6af30c

由 Peter Zijlstra 提交于 8月 05, 2009

Not all tracepoints are created equal, in specific the ftrace
tracepoints are created with TRACE_EVENT_FORMAT() which does
not generate the needed bits to tie them into perf counters.

For those events, don't create the 'id' file and fail
->profile_enable when their ID is specified through other
means.
Reported-by: NChris Mason <chris.mason@oracle.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1249497664.5890.4.camel@laptop>
[ v2: fix build error in the !CONFIG_EVENT_PROFILE case ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

af6af30c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功