提交 · 5478755616ae2ef1ce144dded589b62b2a50d575 · openeuler / raspberrypi-kernel

29 11月, 2010 1 次提交

block: check for proper length of iov entries earlier in blk_rq_map_user_iov() · 54787556

由 Xiaotian Feng 提交于 11月 29, 2010

commit 9284bcf4 checks for proper length of iov entries in
blk_rq_map_user_iov(). But if the map is unaligned, kernel
will break out the loop without checking for the proper length.
So we need to check the proper length before the unalign check.
Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

54787556

18 11月, 2010 1 次提交

BKL: remove extraneous #include <smp_lock.h> · 451a3c24

由 Arnd Bergmann 提交于 11月 17, 2010

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

451a3c24

16 11月, 2010 1 次提交

blk-throttle: Fix calculation of max number of WRITES to be dispatched · c2f6805d

由 Vivek Goyal 提交于 11月 15, 2010

o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
  dispatch round. ummy pointed out that there is a bug in max_nr_writes
  calculation. This patch fixes it.
Reported-by: Nummy y <yummylln@yahoo.com.cn>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c2f6805d

11 11月, 2010 1 次提交

block: remove unused copy_io_context() · cedb4a7d

由 Jens Axboe 提交于 11月 11, 2010

Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

cedb4a7d

10 11月, 2010 5 次提交

block: remove REQ_HARDBARRIER · 02e031cb

由 Christoph Hellwig 提交于 11月 10, 2010

REQ_HARDBARRIER is dead now, so remove the leftovers.  What's left
at this point is:

 - various checks inside the block layer.
 - sanity checks in bio based drivers.
 - now unused bio_empty_barrier helper.
 - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
   but Xen really needs to sort out it's barrier situaton.
 - setting of ordered tags in uas - dead code copied from old scsi
   drivers.
 - scsi different retry for barriers - it's dead and should have been
   removed when flushes were converted to FS requests.
 - blktrace handling of barriers - removed.  Someone who knows blktrace
   better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

02e031cb

block: ioctl: fix information leak to userland · a014741c

由 Vasiliy Kulikov 提交于 11月 08, 2010

Structure hd_geometry is copied to userland with 4 padding bytes
between cylinders and start fields uninitialized on 64-bit platforms.
It leads to leaking of contents of kernel stack memory.

Currently there is no memset() in real implementations of getgeo()
in drivers/block/, so it makes sense to have memset() in blkdev_ioctl().
Signed-off-by: NVasiliy Kulikov <segooon@gmail.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

a014741c

block: read i_size with i_size_read() · 77304d2a

由 Mike Snitzer 提交于 11月 08, 2010

Convert direct reads of an inode's i_size to using i_size_read().

i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes.  Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}.  But
i_size_read() callers do not need to use additional locking.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NNeilBrown <neilb@suse.de>
Acked-by: NLars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

77304d2a

block: take care not to overflow when calculating total iov length · 9f864c80

由 Jens Axboe 提交于 10月 29, 2010

Reported-by: NDan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

9f864c80

block: check for proper length of iov entries in blk_rq_map_user_iov() · 9284bcf4

由 Jens Axboe 提交于 10月 29, 2010

Ensure that we pass down properly validated iov segments before
calling into the mapping or copy functions.
Reported-by: NDan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

9284bcf4

25 10月, 2010 1 次提交

Revert "block: fix accounting bug on cross partition merges" · f253b86b

由 Jens Axboe 提交于 10月 24, 2010

This reverts commit 7681bfee.

Conflicts:

	include/linux/genhd.h

It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

f253b86b

24 10月, 2010 1 次提交

block: fix use-after-free bug in blk throttle code · 7ad58c02

由 Jens Axboe 提交于 10月 23, 2010

blk_throtl_exit() frees the throttle data hanging off the queue
in blk_cleanup_queue(), but blk_put_queue() will indirectly
dereference this data when calling blk_sync_queue() which in
turns calls throtl_shutdown_timer_wq().

Fix this by moving the freeing of the throttle data to when
the queue is truly being released, and post the call to
blk_sync_queue().
Reported-by: NIngo Molnar <mingo@elte.hu>
Tested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7ad58c02

23 10月, 2010 1 次提交

SYSFS: Allow boot time switching between deprecated and modern sysfs layout · e52eec13

由 Andi Kleen 提交于 9月 08, 2010

I have some systems which need legacy sysfs due to old tools that are
making assumptions that a directory can never be a symlink to another
directory, and it's a big hazzle to compile separate kernels for them.

This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
that can be switched on/off the kernel command line. This way
the same binary can be used in both cases with just a option
on the command line.

The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
the default. I kept the weird name to not break existing
config files.

Also the compat code can be still completely disabled by undefining
CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
care of this now instead of lots of ifdefs. This makes the code
look nicer.

v2: This is an updated version on top of Kay's patch to only
handle the block devices. I tested it on my old systems
and that seems to work.

Cc: axboe@kernel.dk
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

e52eec13

22 10月, 2010 1 次提交

cfq-iosched: Fix a gcc 4.5 warning and put some comments · b4627321

由 Vivek Goyal 提交于 10月 22, 2010

- Andi encountedred following warning with gcc 4.5

  linux/block/cfq-iosched.c: In function ‘cfq_dispatch_requests’:
  linux/block/cfq-iosched.c:2156:3: warning: array subscript is above array
  bounds

- Warning happens due to following code.

  slice = group_slice * count /
		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
		cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));

  gcc is complaining about cfqg->busy_queues_avg[] being indexed by CFQ
  prio classes (RT, BE and IDLE) while the array size is only 2.

- At run time, we never access cfqg->busy_queues_avg[IDLE] and return from
  function before this code hits.

- To fix warning increase the array size though it will remain unused. This
  patch also puts some comments to clarify some of the confusions.

- I have taken Jens's patch and modified it a bit.

- Compile tested with gcc 4.4 and boot tested. I don't have gcc 4.5
  running, Andi can you please test it with gcc 4.5 to make sure it
  worked.
Reported-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

b4627321

19 10月, 2010 1 次提交

block: fix accounting bug on cross partition merges · 7681bfee

由 Yasuaki Ishimatsu 提交于 10月 19, 2010

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7681bfee

15 10月, 2010 3 次提交

[SCSI] bsg: fix incorrect device_status value · 47897160

由 FUJITA Tomonori 提交于 9月 17, 2010

bsg incorrectly returns sg's masked_status value for device_status.

[jejb: fix up expression logic]
Reported-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: NJames Bottomley <James.Bottomley@suse.de>

47897160

llseek: automatically add .llseek fop · 6038f373

由 Arnd Bergmann 提交于 8月 15, 2010

All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>

6038f373

block: Fix double free in blk_integrity_unregister · e817bf3f

由 Martin K. Petersen 提交于 10月 15, 2010

Commit 3839e4b2 introduced a kobject_put but failed to remove the
kmem_cache_free beneath it, leading to a double free.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

e817bf3f

14 10月, 2010 1 次提交

block: Ensure physical block size is unsigned int · 892b6f90

由 Martin K. Petersen 提交于 10月 13, 2010

Physical block size was declared unsigned int to accomodate the maximum
size reported by READ CAPACITY(16).  Make sure we use the right type in
the related functions.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

892b6f90

07 10月, 2010 1 次提交

elevator: fix oops on early call to elevator_change() · 430c62fb

由 Jens Axboe 提交于 10月 07, 2010

2.6.36 introduces an API for drivers to switch the IO scheduler
instead of manually calling the elevator exit and init functions.
This API was added since q->elevator must be cleared in between
those two calls. And since we already have this functionality
directly from use by the sysfs interface to switch schedulers
online, it was prudent to reuse it internally too.

But this API needs the queue to be in a fully initialized state
before it is called, or it will attempt to unregister elevator
kobjects before they have been added. This results in an oops
like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
PGD 47ddfc067 PUD 47c6a1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
CPU 2
Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
RIP: 0010:[<ffffffff8116f15e>]  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
RSP: 0018:ffff88027da25d08  EFLAGS: 00010246
RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
FS:  00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
Stack:
 ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
<0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
<0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
Call Trace:
 [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
 [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
 [<ffffffff8123feb9>] kobject_add+0x69/0x90
 [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
 [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
 [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
 [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
 [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
 [<ffffffff81224204>] elv_register_queue+0x34/0xa0
 [<ffffffff81224aad>] elevator_change+0xfd/0x250
 [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
 [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
 [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
 [<ffffffff810001de>] do_one_initcall+0x3e/0x170
 [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
 [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
RIP  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
 RSP <ffff88027da25d08>
CR2: 0000000000000051
---[ end trace a6541d3bf07945df ]---

Fix this by adding a registered bit to the elevator queue, which is
set when the sysfs kobjects have been registered.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

430c62fb

05 10月, 2010 1 次提交

block: autoconvert trivial BKL users to private mutex · 2a48fc0a

由 Arnd Bergmann 提交于 6月 02, 2010

The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.

This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.

file=$1
name=$2
if grep -q lock_kernel ${file} ; then
    if grep -q 'include.*linux.mutex.h' ${file} ; then
            sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
    else
            sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
    fi
    sed -i ${file} \
        -e "/^#include.*linux.mutex.h/,$ {
                1,/^\(static\|int\|long\)/ {
                     /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

} }"  \
    -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
    -e '/[      ]*cycle_kernel_lock();/d'
else
    sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                -e '/cycle_kernel_lock()/d'
fi
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

2a48fc0a

02 10月, 2010 3 次提交

blkio-throttle: Fix possible multiplication overflow in iops calculations · c49c06e4

由 Vivek Goyal 提交于 10月 01, 2010

o User can specify max iops value of 32bit (UINT_MAX), through cgroup
  interface. If a user has specified say 4294967294 (UNIT_MAX  - 2), then
  on 32bit platform, following multiplication can overflow.

  io_allowed = (tg->iops[rw] * jiffy_elapsed_rnd)

o Explicitly cast the multiplication to 64bit and then perform division and
  then check whether result is still great then UNINT_MAX.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c49c06e4

blkio-throttle: limit max iops value to UINT_MAX · 9355aede

由 Vivek Goyal 提交于 10月 01, 2010

- Limit max iops value to UINT_MAX and return error to user if value is more
  than that instead of accepting bigger values and truncating implicitly.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

9355aede

blkio-throttle: There is no need to convert jiffies to milli seconds · 5e901a2b

由 Vivek Goyal 提交于 10月 01, 2010

o Do not convert jiffies to mili seconds as it is not required. Just work
  with jiffies and HZ.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5e901a2b

01 10月, 2010 7 次提交

blkio-throttle: Fix link failure failure on i386 · 3aad5d3e

由 Vivek Goyal 提交于 10月 01, 2010

o Randy Dunlap reported following linux-next failure. This patch fixes it.

on i386:

blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'

o bytes_per_second interface is 64bit and I was continuing to do 64 bit
  division even on 32bit platform without help of special macros/functions
  hence the failure.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Reported-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

3aad5d3e

blkio: Recalculate the throttled bio dispatch time upon throttle limit change · fe071437

由 Vivek Goyal 提交于 10月 01, 2010

o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.

o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.

o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.

o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.

o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

fe071437

blkio: Add root group to td->tg_list · 02977e4a

由 Vivek Goyal 提交于 10月 01, 2010

o Currently all the dynamically allocated groups, except root grp is added
  to td->tg_list. This was not a problem so far but in next patch I will
  travel through td->tg_list to process any updates of limits on the group.
  If root group is not in tg_list, then root group's updates are not
  processed.

o It is better to root group also to tg_list instead of doing special
  processing for it during limit updates.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

02977e4a

blkio: deletion of a cgroup was causes oops · 61014e96

由 Vivek Goyal 提交于 10月 01, 2010

o Now a cgroup list of blkg elements can contain blkg from multiple policies.
Before sending an unlink event, make sure blkg belongs to they policy. If
policy does not own the blkg, do not send update for this blkg.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

61014e96

blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n · 13f98250

由 Vivek Goyal 提交于 10月 01, 2010

Currently throttling related files were visible even if user had disabled
throttling using config options. It was switching off background throttling
of bio but not the cgroup files. This patch fixes it.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

13f98250

block: set the bounce_pfn to the actual DMA limit rather than to max memory · efb012b3

由 Malahal Naineni 提交于 10月 01, 2010

The bounce_pfn of the request queue in 64 bit systems is set to the
current max_low_pfn. Adding more memory later makes this incorrect.
Memory allocated beyond this boot time max_low_pfn appear to require
bounce buffers (bounce buffers are actually not allocated but used in
calculating segments that may result in "over max segments limit"
errors).
Signed-off-by: NMalahal Naineni <malahal@us.ibm.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

efb012b3

block: revert bad fix for memory hotplug causing bounces · 260a67a9

由 Jens Axboe 提交于 10月 01, 2010

Revert "block: set the bounce_pfn to the actual DMA limit rather than to max memory"

This reverts commit c49825fa.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

260a67a9

25 9月, 2010 2 次提交

block: prevent merges of discard and write requests · f281fb5f

由 Adrian Hunter 提交于 9月 25, 2010

Add logic to prevent two I/O requests being merged if
only one of them is a discard.  Ditto secure discard.

Without this fix, it is possible for write requests
to transform into discard requests.  For example:

  Submit bio 1 to discard 8 sectors from sector n
  Submit bio 2 to write 8 sectors from sector n + 16
  Submit bio 3 to write 8 sectors from sector n + 8

Bio 1 becomes request 1.  Bio 2 becomes request 2.
Bio 3 is merged with request 2, and then subsequently
request 2 is merged with request 1 resulting in just
one I/O request which discards all 24 sectors.
Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>

(Moved the checks above the position checks /Jens)
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

f281fb5f

block: set the bounce_pfn to the actual DMA limit rather than to max memory · c49825fa

由 Malahal Naineni 提交于 9月 24, 2010

The bounce_pfn of the request queue in 64 bit systems is set to the
current max_low_pfn. Adding more memory later makes this incorrect.
Memory allocated beyond this boot time max_low_pfn appear to require
bounce buffers (bounce buffers are actually not allocated but used in
calculating segments that may result in "over max segments limit"
errors).
Signed-off-by: NMalahal Naineni <malahal@us.ibm.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c49825fa

24 9月, 2010 1 次提交

block: Prevent hang_check firing during long I/O · 4b197769

由 Mark Lord 提交于 9月 24, 2010

During long I/O operations, the hang_check timer may fire,
trigger stack dumps that unnecessarily alarm the user.

Eg.  hdparm --security-erase NULL /dev/sdb  ## can take *hours* to complete

So, if hang_check is armed, we should wake up periodically
to prevent it from triggering.  This patch uses a wake-up interval
equal to half the hang_check timer period, which keeps overhead low enough.
Signed-off-by: NMark Lord <mlord@pobox.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

4b197769

21 9月, 2010 2 次提交

cfq-iosched: fix a kernel OOPs when usb key is inserted · 180be2a0

由 Vivek Goyal 提交于 9月 14, 2010

Mike reported a kernel crash when a usb key hotplug is performed while all
kernel thrads are not in a root cgroup and are running in one of the child
cgroups of blkio controller.

	BUG: unable to handle kernel NULL pointer dereference at 0000002c
	IP: [<c11c7b08>] cfq_get_queue+0x232/0x412
	*pde = 00000000
	Oops: 0000 [#1] PREEMPT
	last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb2/2-1/2-1:1.0/host3/scsi_host/host3/uevent

	[..]
	Pid: 30039, comm: scsi_scan_3 Not tainted 2.6.35.2-fg.roam #1 Volvi2                         /Aspire 4315
	EIP: 0060:[<c11c7b08>] EFLAGS: 00010086 CPU: 0
	EIP is at cfq_get_queue+0x232/0x412
	EAX: f705f9c0 EBX: e977abac ECX: 00000000 EDX: 00000000
	ESI: f00da400 EDI: f00da4ec EBP: e977a800 ESP: dff8fd00
	 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
	Process scsi_scan_3 (pid: 30039, ti=dff8e000 task=f6b6c9a0 task.ti=dff8e000)
	Stack:
	 00000000 00000000 00000001 01ff0000 f00da508 00000000 f00da524 f00da540
	<0> e7994940 dd631750 f705f9c0 e977a820 e977ac44 f00da4d0 00000001 f6b6c9a0
	<0> 00000010 00008010 0000000b 00000000 00000001 e977a800 dd76fac0 00000246
	Call Trace:
	 [<c11c7f10>] ? cfq_set_request+0x228/0x34c
	 [<c11c7ce8>] ? cfq_set_request+0x0/0x34c
	 [<c11bb3b9>] ? elv_set_request+0xf/0x1c
	 [<c11bdd51>] ? get_request+0x1ad/0x22f
	 [<c11bddf2>] ? get_request_wait+0x1f/0x11a
	 [<c11d013b>] ? kvasprintf+0x33/0x3b
	 [<c127b537>] ? scsi_execute+0x1d/0x103
	 [<c127b675>] ? scsi_execute_req+0x58/0x83
	 [<c127c391>] ? scsi_probe_and_add_lun+0x188/0x7c2
	 [<c12718c6>] ? attribute_container_add_device+0x15/0xfa
	 [<c11c95d1>] ? kobject_get+0xf/0x13
	 [<c126d1db>] ? get_device+0x10/0x14
	 [<c127be93>] ? scsi_alloc_target+0x217/0x24d
	 [<c127cbd8>] ? __scsi_scan_target+0x95/0x480
	 [<c10204eb>] ? dequeue_entity+0x14/0x1fe
	 [<c1020491>] ? update_curr+0x165/0x1ab
	 [<c1020491>] ? update_curr+0x165/0x1ab
	 [<c127d00d>] ? scsi_scan_channel+0x4a/0x76
	 [<c127d0b0>] ? scsi_scan_host_selected+0x77/0xad
	 [<c127d13c>] ? do_scan_async+0x0/0x11a
	 [<c127d137>] ? do_scsi_scan_host+0x51/0x56
	 [<c127d13c>] ? do_scan_async+0x0/0x11a
	 [<c127d14a>] ? do_scan_async+0xe/0x11a
	 [<c127d13c>] ? do_scan_async+0x0/0x11a
	 [<c10354c5>] ? kthread+0x5e/0x63
	 [<c1035467>] ? kthread+0x0/0x63
	 [<c1002af6>] ? kernel_thread_helper+0x6/0x10
	Code: 44 24 1c 54 83 44 24 18 54 83 fa 03 75 94 8b 06 c7 86 64 02 00 00 01 00 00 00 83 e0 03 09 f0 89 06 8b 44 24 28 8b 90 58 01 00 00 <8b> 42 2c 85 c0 75 03 8b 42 08 8d 54 24 48 52 8d 4c 24 50 51 68
	EIP: [<c11c7b08>] cfq_get_queue+0x232/0x412 SS:ESP 0068:dff8fd00
	CR2: 000000000000002c
	---[ end trace 9a88306573f69b12 ]---

The problem here is that we don't have bdi->dev information available when
thread does some IO.  Hence when dev_name() tries to access bdi->dev, it
crashes.

This problem does not happen if kernel threads are in root group as root
group is statically allocated at device initialization time and we don't
hit this piece of code.

Fix it by delaying the filling of major and minor number information of
device in blk_group.  Initially a blk_group is created with 0 as device
information and this information is filled later once some more IO comes
in from same group.
Reported-by: NMike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

180be2a0

block: fix blk_rq_map_kern bio direction flag · a45dc2d2

由 Benny Halevy 提交于 9月 13, 2010

This bug was introduced in 7b6d91da
"block: unify flags for struct bio and struct request"

Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

a45dc2d2

20 9月, 2010 1 次提交

cfq: improve fsync performance for small files · 749ef9f8

由 Corrado Zoccolo 提交于 9月 20, 2010

Fsync performance for small files achieved by cfq on high-end disks is
lower than what deadline can achieve, due to idling introduced between
the sync write happening in process context and the journal commit.

Moreover, when competing with a sequential reader, a process writing
small files and fsync-ing them is starved.

This patch fixes the two problems by:
- marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
  flag set,
- force all queues that have REQ_NOIDLE requests to be put in the noidle
  tree.

Having the queue associated to the fsync-ing process and the one associated
 to journal commits in the noidle tree allows:
- switching between them without idling,
- fairness vs. competing idling queues, since they will be serviced only
  after the noidle tree expires its slice.
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Tested-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NCorrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

749ef9f8

17 9月, 2010 2 次提交

block: remove BLKDEV_IFL_WAIT · dd3932ed

由 Christoph Hellwig 提交于 9月 16, 2010

All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

dd3932ed

block: Fix race during disk initialization · 01ea5063

由 Signed-off-by: Jan Kara 提交于 9月 16, 2010

When a new disk is being discovered, add_disk() first ties the bdev to gendisk
(via register_disk()->blkdev_get()) and only after that calls
bdi_register_bdev(). Because register_disk() also creates disk's kobject, it
can happen that userspace manages to open and modify the device's data (or
inode) before its BDI is properly initialized leading to a warning in
__mark_inode_dirty().

Fix the problem by registering BDI early enough.

This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312

Cc: stable@kernel.org
Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

01ea5063

16 9月, 2010 2 次提交

blkio: Implementation of IOPS limit logic · 8e89d13f

由 Vivek Goyal 提交于 9月 15, 2010

o core logic of implementing IOPS throttling.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

8e89d13f

blk-cgroup: cgroup changes for IOPS limit support · 7702e8f4

由 Vivek Goyal 提交于 9月 15, 2010

o cgroup changes for IOPS throttling rules.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7702e8f4