提交 · 4345caba340f051e10847924fc078ae18ed6695c · FengXiao2002 / Linux Imx

07 1月, 2011 1 次提交

block: add internal hd part table references · 6c23a968

由 Jens Axboe 提交于 1月 07, 2011

We can't use krefs since it's apparently restricted to very basic
reference counting.

This reverts commit e4a683c8.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

6c23a968

05 1月, 2011 1 次提交

block: fix accounting bug on cross partition merges · 09e099d4

由 Jerome Marchand 提交于 1月 05, 2011

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.
Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

09e099d4

17 12月, 2010 1 次提交

block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66

由 Martin K. Petersen 提交于 12月 01, 2010

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.
Reported-by: NEd Lin <ed.lin@promise.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

e692cb66

25 10月, 2010 1 次提交

Revert "block: fix accounting bug on cross partition merges" · f253b86b

由 Jens Axboe 提交于 10月 24, 2010

This reverts commit 7681bfee.

Conflicts:

	include/linux/genhd.h

It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

f253b86b

19 10月, 2010 1 次提交

block: fix accounting bug on cross partition merges · 7681bfee

由 Yasuaki Ishimatsu 提交于 10月 19, 2010

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7681bfee

25 9月, 2010 1 次提交

block: prevent merges of discard and write requests · f281fb5f

由 Adrian Hunter 提交于 9月 25, 2010

Add logic to prevent two I/O requests being merged if
only one of them is a discard.  Ditto secure discard.

Without this fix, it is possible for write requests
to transform into discard requests.  For example:

  Submit bio 1 to discard 8 sectors from sector n
  Submit bio 2 to write 8 sectors from sector n + 16
  Submit bio 3 to write 8 sectors from sector n + 8

Bio 1 becomes request 1.  Bio 2 becomes request 2.
Bio 3 is merged with request 2, and then subsequently
request 2 is merged with request 1 resulting in just
one I/O request which discards all 24 sectors.
Signed-off-by: NAdrian Hunter <adrian.hunter@nokia.com>

(Moved the checks above the position checks /Jens)
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

f281fb5f

11 9月, 2010 1 次提交

block/scsi: Provide a limit on the number of integrity segments · 13f05c8d

由 Martin K. Petersen 提交于 9月 10, 2010

Some controllers have a hardware limit on the number of protection
information scatter-gather list segments they can handle.

Introduce a max_integrity_segments limit in the block layer and provide
a new scsi_host_template setting that allows HBA drivers to provide a
value suitable for the hardware.

Add support for honoring the integrity segment limit when merging both
bios and requests.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@carl.home.kernel.dk>

13f05c8d

08 8月, 2010 3 次提交

gcc-4.6: block: fix unused but set variables in blk-merge · 2c8919de

由 Andi Kleen 提交于 6月 21, 2010

Just some dead code.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

2c8919de

block: unify flags for struct bio and struct request · 7b6d91da

由 Christoph Hellwig 提交于 8月 07, 2010

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7b6d91da

block: remove wrappers for request type/flags · 33659ebb

由 Christoph Hellwig 提交于 8月 07, 2010

Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests.  This allows much easier grepping for different request
types instead of unwinding through macros.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

33659ebb

26 2月, 2010 1 次提交

block: Consolidate phys_segment and hw_segment limits · 8a78362c

由 Martin K. Petersen 提交于 2月 26, 2010

Except for SCSI no device drivers distinguish between physical and
hardware segment limits.  Consolidate the two into a single segment
limit.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8a78362c

07 10月, 2009 1 次提交

block: Seperate read and write statistics of in_flight requests v2 · 316d315b

由 Nikanth Karthikesan 提交于 10月 06, 2009

Commit a9327cac added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.

But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.

So this was reverted by commit 0f78ab98

The problem was in part_round_stats_single(), I missed the following:
        if (now == part->stamp)
                return;

-       if (part->in_flight) {
+       if (part_in_flight(part)) {
                __part_stat_add(cpu, part, time_in_queue,
                                part_in_flight(part) * (now - part->stamp));
                __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

With this chunk included, the reported regression gets fixed.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>

--
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

316d315b

05 10月, 2009 1 次提交

Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98

由 Jens Axboe 提交于 10月 04, 2009

This reverts commit a9327cac.

Corrado Zoccolo <czoccolo@gmail.com> reports:

"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10,70    0,00    3,16   15,75    0,00   70,38

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda              18,22     0,00    0,67    0,01    14,77     0,02
43,94     0,01   10,53 39043915,03 2629219,87
sdb              60,89     9,68   50,79    3,04  1724,43    50,52
65,95     0,70   13,06 488437,47 2629219,87

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,72    0,00    0,74    0,00    0,00   96,53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6,68    0,00    0,99    0,00    0,00   92,33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4,40    0,00    0,73    1,47    0,00   93,40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00
0,00     0,00    0,00   0,00 100,00
sdb               0,00     4,00    0,00    3,00     0,00    28,00
18,67     0,06   19,50 333,33 100,00

Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.

I bisected it down to:
[a9327cac] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."

So until this is debugged, revert the bad commit.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

0f78ab98

14 9月, 2009 1 次提交

Seperate read and write statistics of in_flight requests · a9327cac

由 Nikanth Karthikesan 提交于 9月 11, 2009

Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a9327cac

11 9月, 2009 2 次提交

scsi,block: update SCSI to handle mixed merge failures · da6c5c72

由 Tejun Heo 提交于 9月 11, 2009

Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover.  This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors.  Allow merge of requests of different failfast
settings.

As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

da6c5c72

block: implement mixed merge of different failfast requests · 80a761fd

由 Tejun Heo 提交于 7月 03, 2009

Failfast has characteristics from other attributes.  When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference.  It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.

This patch introduces the concept of 'mixed merge'.  A request is a
mixed merge if it is merge of segments which require different
handling on failure.  Currently the only mixable attributes are
failfast ones (or lack thereof).

When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed.  Each bio carries failfast settings
and the request always tracks failfast state of the first bio.  When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.

This allows request merging regardless of failfast settings while
keeping the failure handling correct.

This patch only implements mixed merge but doesn't enable it.  The
next one will update SCSI to make use of mixed merge.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

80a761fd

04 7月, 2009 1 次提交

block: don't merge requests of different failfast settings · ab0fd1de

由 Tejun Heo 提交于 7月 03, 2009

Block layer used to merge requests and bios with different failfast
settings.  This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.

Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR.  ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors.  Please read the following thread for
details.

  http://lkml.org/lkml/2009/5/23/21

This patch makes block layer reject merging if the failfast settings
don't match.  This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests.  Changes to allow such mixed merges and handle errors
correctly will be added later.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NNiel Lambrechts <niel.lambrechts@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: NJens Axboe <axboe@carl.(none)>

ab0fd1de

23 5月, 2009 1 次提交

block: Use accessor functions for queue limits · ae03bf63

由 Martin K. Petersen 提交于 5月 22, 2009

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ae03bf63

11 5月, 2009 3 次提交

block: hide request sector and data_len · a2dec7b3

由 Tejun Heo 提交于 5月 07, 2009

Block low level drivers for some reason have been pretty good at
abusing block layer API.  Especially struct request's fields tend to
get violated in all possible ways.  Make it clear that low level
drivers MUST NOT access or manipulate rq->sector and rq->data_len
directly by prefixing them with double underscores.

This change is also necessary to break build of out-of-tree codes
which assume the previous block API where internal fields can be
manipulated and rq->data_len carries residual count on completion.

[ Impact: hide internal fields, block API change ]
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a2dec7b3

block: drop request->hard_* and *nr_sectors · 2e46e8b2

由 Tejun Heo 提交于 5月 07, 2009

struct request has had a few different ways to represent some
properties of a request.  ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers.  The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion.  In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.

Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size.  This is more convoluted than the hard_ case.

rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests.  This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.

rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front.  This is mainly used by drivers
which transfers data by walking request segment-by-segment.  This
value always equals rq->bio->bi_size >> 9.  However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.

In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs.  With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly.  Drop all the duplicates.  Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length.  Everything else is
defined in terms of these three and available only through accessors.

* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
  now handles pc and fs requests equally other than rq->sector update.
  This means that now pc requests can use partial completion too (no
  in-kernel user yet tho).

* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
  now uses byte count as the primary data length.

* blk_rq_pos() is now guranteed to be always correct.  In-block users
  converted.

* blk_rq_bytes() is now guaranteed to be always valid as is
  blk_rq_sectors().  In-block users converted.

* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
  More convenient one is used.

* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
  pointer to request.

[ Impact: API cleanup, single way to represent one property of a request ]
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2e46e8b2

block: convert to pos and nr_sectors accessors · 83096ebf

由 Tejun Heo 提交于 5月 07, 2009

With recent cleanups, there is no place where low level driver
directly manipulates request fields.  This means that the 'hard'
request fields always equal the !hard fields.  Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.

While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

[ Impact: use pos and nr_sectors accessors ]
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Tested-by: NGrant Likely <grant.likely@secretlab.ca>
Acked-by: NGrant Likely <grant.likely@secretlab.ca>
Tested-by: NAdrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: NAdrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: NMike Miller <mike.miller@hp.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Dario Ballabio <ballabio_dario@emc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: unsik Kim <donari75@gmail.com>
Cc: Laurent Vivier <Laurent@lvivier.info>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

83096ebf

24 4月, 2009 1 次提交

block: simplify I/O stat accounting · 42dad764

由 Jerome Marchand 提交于 4月 22, 2009

This simplifies I/O stat accounting switching code and separates it
completely from I/O scheduler switch code.

Requests are accounted according to the state of their request queue
at the time of the request allocation. There is no need anymore to
flush the request queue when switching I/O accounting state.
Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

42dad764

07 4月, 2009 1 次提交

block: fix inconsistency in I/O stat accounting code · 26308eab

由 Jerome Marchand 提交于 3月 27, 2009

This forces in_flight to be zero when turning off or on the I/O stat
accounting and stops updating I/O stats in attempt_merge() when
accounting is turned off.
Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

26308eab

26 3月, 2009 1 次提交

block: WARN in __blk_put_request() for potential bio leak · 1cd96c24

由 Boaz Harrosh 提交于 3月 24, 2009

Put a WARN_ON in __blk_put_request if it is about to
leak bio(s). This is a serious bug that can happen in error
handling code paths.

For this to work I have fixed a couple of places in block/ where
request->bio != NULL ownership was not honored. And a small cleanup
at sg_io() while at it.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1cd96c24

06 3月, 2009 1 次提交

block: fix missing bio back/front segment size setting in blk_recount_segments() · 59247eae

由 Jens Axboe 提交于 3月 06, 2009

Commit 1e428079 introduced a bug where we
don't get front/back segment sizes in the bio in blk_recount_segments().
Fix this by tracking the back bio as well as the front bio in
__blk_recalc_rq_segments(), this also cleans up the interface by getting
rid of the segment size pointer passing.
Tested-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

59247eae

26 2月, 2009 1 次提交

block: reduce stack footprint of blk_recount_segments() · 1e428079

由 Jens Axboe 提交于 2月 23, 2009

blk_recalc_rq_segments() requires a request structure passed in, which
we don't have from blk_recount_segments(). So the latter allocates one on
the stack, using > 400 bytes of stack for that. This can cause us to spill
over one page of stack from ext4 at least:

 0)     4560     400   blk_recount_segments+0x43/0x62
 1)     4160      32   bio_phys_segments+0x1c/0x24
 2)     4128      32   blk_rq_bio_prep+0x2a/0xf9
 3)     4096      32   init_request_from_bio+0xf9/0xfe
 4)     4064     112   __make_request+0x33c/0x3f6
 5)     3952     144   generic_make_request+0x2d1/0x321
 6)     3808      64   submit_bio+0xb9/0xc3
 7)     3744      48   submit_bh+0xea/0x10e
 8)     3696     368   ext4_mb_init_cache+0x257/0xa6a [ext4]
 9)     3328     288   ext4_mb_regular_allocator+0x421/0xcd9 [ext4]
10)     3040     160   ext4_mb_new_blocks+0x211/0x4b4 [ext4]
11)     2880     336   ext4_ext_get_blocks+0xb61/0xd45 [ext4]
12)     2544      96   ext4_get_blocks_wrap+0xf2/0x200 [ext4]
13)     2448      80   ext4_da_get_block_write+0x6e/0x16b [ext4]
14)     2368     352   mpage_da_map_blocks+0x7e/0x4b3 [ext4]
15)     2016     352   ext4_da_writepages+0x2ce/0x43c [ext4]
16)     1664      32   do_writepages+0x2d/0x3c
17)     1632     144   __writeback_single_inode+0x162/0x2cd
18)     1488      96   generic_sync_sb_inodes+0x1e3/0x32b
19)     1392      16   sync_sb_inodes+0xe/0x10
20)     1376      48   writeback_inodes+0x69/0xb3
21)     1328     208   balance_dirty_pages_ratelimited_nr+0x187/0x2f9
22)     1120     224   generic_file_buffered_write+0x1d4/0x2c4
23)      896     176   __generic_file_aio_write_nolock+0x35f/0x393
24)      720      80   generic_file_aio_write+0x6c/0xc8
25)      640      80   ext4_file_write+0xa9/0x137 [ext4]
26)      560     320   do_sync_write+0xf0/0x137
27)      240      48   vfs_write+0xb3/0x13c
28)      192      64   sys_write+0x4c/0x74
29)      128     128   system_call_fastpath+0x16/0x1b

Split the segment counting out into a __blk_recalc_rq_segments() helper
to avoid allocating an onstack request just for checking the physical
segment count.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1e428079

06 11月, 2008 1 次提交

block: remove unused ll_new_mergeable() · 43381785

由 FUJITA Tomonori 提交于 10月 20, 2008

Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

43381785

17 10月, 2008 1 次提交

block: fix nr_phys_segments miscalculation bug · 86771427

由 FUJITA Tomonori 提交于 10月 13, 2008

This fixes the bug reported by Nikanth Karthikesan <knikanth@suse.de>:

http://lkml.org/lkml/2008/10/2/203

The root cause of the bug is that blk_phys_contig_segment
miscalculates q->max_segment_size.

blk_phys_contig_segment checks:

req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size

But blk_recalc_rq_segments might expect that req->biotail and the
previous bio in the req are supposed be merged into one
segment. blk_recalc_rq_segments might also expect that next_req->bio
and the next bio in the next_req are supposed be merged into one
segment. In such case, we merge two requests that can't be merged
here. Later, blk_rq_map_sg gives more segments than it should.

We need to keep track of segment size in blk_recalc_rq_segments and
use it to see if two requests can be merged. This patch implements it
in the similar way that we used to do for hw merging (virtual
merging).
Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

86771427

09 10月, 2008 8 次提交

block: inherit CPU completion on bio->rq and rq->rq merges · ab780f1e

由 Jens Axboe 提交于 8月 26, 2008

Somewhat incomplete, as we do allow merges of requests and bios
that have different completion CPUs given. This is done on the
assumption that a larger IO is still more beneficial than CPU
locality.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ab780f1e

block: move stats from disk to part0 · 074a7aca

由 Tejun Heo 提交于 8月 25, 2008

Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
  is not part0.  ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary.  It handles part0 stats
  automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
  part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
  Combined with the above changes, this makes NULL special case
  handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
  stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

074a7aca

block: fix diskstats access · c9959059

由 Tejun Heo 提交于 8月 25, 2008

There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters.  It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.

This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access).  diskstats access
should always be enclosed between the two functions.  As such, there's
no need for the versions which disables preemption.  They're removed
and double underbars ones are renamed to drop the underbars.  As an
extra argument is added, there's no danger of using the old version
unconverted.

disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.

This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others.  Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

c9959059

block: fix disk->part[] dereferencing race · e71bf0d0

由 Tejun Heo 提交于 9月 03, 2008

disk->part[] is protected by its matching bdev's lock.  However,
non-critical accesses like collecting stats and printing out sysfs and
proc information used to be performed without any locking.  As
partitions can come and go dynamically, partitions can go away
underneath those non-critical accesses.  As some of those accesses are
writes, this theoretically can lead to silent corruption.

This patch fixes the race by using RCU for the partition array and dev
reference counter to hold partitions.

* Rename disk->part[] to disk->__part[] to make sure no one outside
  genhd layer proper accesses it directly.

* Use RCU for disk->__part[] dereferencing.

* Implement disk_{get|put}_part() which can be used to get and put
  partitions from gendisk respectively.

* Iterators are implemented to help iterate through all partitions
  safely.

* Functions which require RCU readlock are marked with _rcu suffix.

* Use disk_put_part() in __blkdev_put() instead of directly putting
  the contained kobject.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e71bf0d0

block: misc updates · 310a2c10

由 Tejun Heo 提交于 8月 25, 2008

This patch makes the following misc updates in preparation for
disk->part dereference fix and extended block devt support.

* implment part_to_disk()

* fix comment about gendisk->part indexing

* rename get_part() to disk_map_sector()

* don't use n which is always zero while printing disk information in
  diskstats_show()
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

310a2c10

drop vmerge accounting · 5df97b91

由 Mikulas Patocka 提交于 8月 15, 2008

Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5df97b91

block: drop virtual merging accounting · b8b3e16c

由 Mikulas Patocka 提交于 8月 15, 2008

Remove virtual merge accounting.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

b8b3e16c

Allow elevators to sort/merge discard requests · e17fc0a1

由 David Woodhouse 提交于 8月 09, 2008

But blkdev_issue_discard() still emits requests which are interpreted as
soft barriers, because naïve callers might otherwise issue subsequent
writes to those same sectors, which might cross on the queue (if they're
reallocated quickly enough).

Callers still _can_ issue non-barrier discard requests, but they have to
take care of queue ordering for themselves.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e17fc0a1

03 7月, 2008 1 次提交

block: Block layer data integrity support · 7ba1ba12

由 Martin K. Petersen 提交于 6月 30, 2008

Some block devices support verifying the integrity of requests by way
of checksums or other protection information that is submitted along
with the I/O.

This patch implements support for generating and verifying integrity
metadata, as well as correctly merging, splitting and cloning bios and
requests that have this extra information attached.

See Documentation/block/data-integrity.txt for more information.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

7ba1ba12

07 5月, 2008 1 次提交

block: get rid of likely/unlikely predictions in merge logic · 2cdf79ca

由 Jens Axboe 提交于 5月 07, 2008

They tend to depend a lot on the workload, so not a clear-cut
likely or unlikely fit.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2cdf79ca

29 4月, 2008 1 次提交

block: make queue flags non-atomic · 75ad23bc

由 Nick Piggin 提交于 4月 29, 2008

We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

75ad23bc

21 4月, 2008 1 次提交

block: move the padding adjustment to blk_rq_map_sg · f18573ab

由 FUJITA Tomonori 提交于 4月 11, 2008

blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
that req->data_len (the true data length) is equal to sum(bio). It
broke the scsi command completion code.

commit e97a294e was introduced to fix
the above issue. However, the partial completion code doesn't work
with it. The commit is also a layer violation (scsi mid-layer should
not know about the block layer's padding).

This patch moves the padding adjustment to blk_rq_map_sg (suggested by
James). The padding works like the drain buffer. This patch breaks the
rule that req->data_len is equal to sum(sg), however, the drain buffer
already broke it. So this patch just restores the rule that
req->data_len is equal to sub(bio) without breaking anything new.

Now when a low level driver needs padding, blk_rq_map_user and
blk_rq_map_user_iov guarantee there's enough room for padding.
blk_rq_map_sg can safely extend the last entry of a scatter list.

blk_rq_map_sg must extend the last entry of a scatter list only for a
request that got through bio_copy_user_iov. This patches introduces
new REQ_COPY_USER flag.
Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

f18573ab