提交 · 9d8f0363623b3da12c43007cf77f5e1a4e8a5964 · openeuler / raspberrypi-kernel

18 6月, 2009 1 次提交

md: Make mddev->chunk_size sector-based. · 9d8f0363

由 Andre Noll 提交于 6月 18, 2009

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

9d8f0363

16 6月, 2009 21 次提交

md: raid0 :Enables chunk size other than powers of 2. · fbb704ef

由 raz ben yehuda 提交于 6月 16, 2009

Maintain two flows, one for pow2 chunk sizes (which uses masks and
shift), and a flow for the general case (which uses sector_div).
This is for the sake of performance.

 - introduce map_sector and is_io_in_chunk_boundary to encapsulate
   those two flows better for raid0_make_request
 - fix blk_mergeable to support the two flows.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

fbb704ef

md: prepare for non-power-of-two chunk sizes · 2ac06c33

由 raz ben yehuda 提交于 6月 16, 2009

Remove chunk size check from md as this is now performed in the run
function in each personality.

Replace chunk size power 2 code calculations by a regular division.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

2ac06c33

md: raid5: chunk size check in setup_conf · 740da449

由 raz ben yehuda 提交于 6月 16, 2009

have raid5 check chunk size in run/reshape method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

740da449

md: raid10: chunk size check in run · 964e7913

由 raz ben yehuda 提交于 6月 16, 2009

have raid10 check chunk size in run method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

964e7913

md: raid0: chunk size check in raid0_run · 92e59b6b

由 raz ben yehuda 提交于 6月 16, 2009

have raid0 check chunk size in run method instead of in md.
This is part of a series moving the checks from common code to
the personalities where they belong.

hardsect is short and chunksize is an int, so it is safe to use %.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

92e59b6b

md: have raid0 report its formation · 46994191

由 raz ben yehuda 提交于 6月 16, 2009

Report to the user what are the raid zones

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

46994191

md: have raid0 compile with MD_DEBUG on · 1b961429

由 raz ben yehuda 提交于 6月 16, 2009

Because of the removal of the device list from
the strips raid0 did not compile with MD_DEBUG flag on
Signed-off-by: NNeilBrown <neilb@suse.de>

1b961429

md: Binary search in linear raid · aece3d1f

由 Sandeep K Sinha 提交于 6月 16, 2009

Replace the linear search with binary search in which_dev.
Signed-off-by: NSandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

aece3d1f

md: Removing num_sector and replacing start_sector with end_sector · 4db7cdc8

由 Sandeep K Sinha 提交于 6月 16, 2009

Remove num_sectors from dev_info and replace start_sector with
end_sector.  This makes a lot of comparisons much simpler.
Signed-off-by: NSandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

4db7cdc8

md: Removal of hash table in linear raid · 45d4582f

由 Sandeep K Sinha 提交于 6月 16, 2009

Get rid of sector_div and hash table for linear raid and replace
with a linear search in which_dev.
The hash table adds a lot of complexity for little if any gain.
Ultimately a binary search will be used which will have smaller
cache foot print, a similar number of memory access, and no
divisions.
Signed-off-by: NSandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

45d4582f

md: remove mddev_to_conf "helper" macro · 070ec55d

由 NeilBrown 提交于 6月 16, 2009

Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.
Signed-off-by: NNeilBrown <neilb@suse.de>

070ec55d

md: raid0: remove setting of segment boundary. · a6b3deaf

由 NeilBrown 提交于 6月 16, 2009


This setting doesn't seem to make sense (half the chunk size??) and
shouldn't be needed.
The segment boundary exported by raid0 should simply be the minimum
of the segment boundary of all component devices.  And we already
get that right.
Signed-off-by: NNeilBrown <neilb@suse.de>

a6b3deaf

md: raid0: remove ->dev pointer from strip_zone structure · b414579f

由 NeilBrown 提交于 6月 16, 2009

If we treat conf->devlist more like a 2 dimensional array,
we can get the devlist for a particular zone simply by indexing
that array, so we don't need to store the pointers to subarrays
in strip_zone.  This makes strip_zone smaller and so (hopefully)
searches faster.
Signed-of-by: NNeilBrown <neilb@suse.de>

b414579f

md: raid0: remove ->sectors from the strip_zone structure. · 49f357a2

由 NeilBrown 提交于 6月 16, 2009

storing ->sectors is redundant as is can be computed from the
difference  z->zone_end - (z-1)->zone_end

The one place where it is used, it is just as efficient to use
a zone_end value instead.

And removing it makes strip_zone smaller, so they array of these that
is searched on every request has a better chance to say in cache.

So discard the field and get the value from elsewhere.
Signed-off-by: NNeilBrown <neilb@suse.de>

49f357a2

md: raid0: Fix a memory leak when stopping a raid0 array. · fb5ab4b5

由 Andre Noll 提交于 6月 16, 2009

raid0_stop() removes all references to the raid0 configuration but
misses to free the ->devlist buffer.

This patch closes this leak, removes a pointless initialization and
fixes a coding style issue in raid0_stop().
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

fb5ab4b5

md: raid0: Allocate all buffers for the raid0 configuration in one function. · ed7b0038

由 Andre Noll 提交于 6月 16, 2009

Currently the raid0 configuration is allocated in raid0_run() while
the buffers for the strip_zone and the dev_list arrays are allocated
in create_strip_zones(). On errors, all three buffers are freed
in raid0_run().

It's easier and more readable to do the allocation and cleanup within
a single function. So move that code into create_strip_zones().
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

ed7b0038

md: raid0: Make raid0_run() return a proper error code. · 5568a603

由 Andre Noll 提交于 6月 16, 2009

Currently raid0_run() always returns -ENOMEM on errors. This is
incorrect as running the array might fail for other reasons, for
example because not all component devices were available.

This patch changes create_strip_zones() so that it returns a proper
error code (either -ENOMEM or -EINVAL) rather than 1 on errors and
makes raid0_run(), its single caller, return that value instead
of -ENOMEM.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5568a603

md: raid0: Remove hash spacing and sector shift. · 8f79cfcd

由 Andre Noll 提交于 6月 16, 2009

The "sector_shift" and "spacing" fields of struct raid0_private_data
were only used for the hash table lookups. So the removal of the
hash table allows get rid of these fields as well which simplifies
create_strip_zones() and raid0_run() quite a bit.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

8f79cfcd

md: raid0: Remove hash table. · 09770e0b

由 Andre Noll 提交于 6月 16, 2009

The raid0 hash table has become unused due to the changes in the
previous patch. This patch removes the hash table allocation and
setup code and kills the hash_table field of struct raid0_private_data.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

09770e0b

md/raid0: two cleanups in create_stripe_zones. · d27a43ab

由 NeilBrown 提交于 6月 16, 2009

1/ remove current_start.  The same value is available in
     zone->dev_start and storing it separately doesn't gain anything.
2/ rename curr_zone_start to curr_zone_end as we are now more
     focused on the 'end' of each zone.  We end up storing the
     same number though - the old name was a little confusing
     (and what does 'current' mean in this context anyway).
Signed-off-by: NNeilBrown <neilb@suse.de>

d27a43ab

md: raid0: Replace hash table lookup by looping over all strip_zones. · dc582663

由 Andre Noll 提交于 6月 16, 2009

The number of strip_zones of a raid0 array is bounded by the number of
drives in the array and is in fact much smaller for typical setups. For
example, any raid0 array containing identical disks will have only
a single strip_zone.

Therefore, the hash tables which are used for quickly finding the
strip_zone that holds a particular sector are of questionable value
and add quite a bit of unnecessary complexity.

This patch replaces the hash table lookup by equivalent code which
simply loops over all strip zones to find the zone that holds the
given sector.

In order to make this loop as fast as possible, the zone->start field
of struct strip_zone has been renamed to zone_end, and it now stores
the beginning of the next zone in sectors. This allows to save one
addition in the loop.

Subsequent cleanup patches will remove the hash table structure.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

dc582663

10 6月, 2009 1 次提交

tracing/events: convert block trace points to TRACE_EVENT() · 55782138

由 Li Zefan 提交于 6月 09, 2009

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

55782138

09 6月, 2009 4 次提交

md/raid5: fix bug in reshape code when chunk_size decreases. · 0e6e0271

由 NeilBrown 提交于 6月 09, 2009

Now that we support changing the chunksize, we calculate
"reshape_sectors" to be the max of number of sectors in old
and new chunk size.
However there is one please where we still use 'chunksize'
rather than 'reshape_sectors'.
This causes a reshape that reduces the size of chunks to freeze.
Signed-off-by: NNeilBrown <neilb@suse.de>

0e6e0271

md/raid5 - avoid deadlocks in get_active_stripe during reshape · a8c906ca

由 NeilBrown 提交于 6月 09, 2009

md has functionality to 'quiesce' and array so that all pending
IO completed and no new IO starts.  This is used to achieve a
stable state before making internal changes.

Currently this quiescing applies equally to normal IO, resync
IO, and reshape IO.
However there is a problem with applying it to reshape IO.
Reshape can have multiple 'stripe_heads' that must be active together.
If the quiesce come between allocating the first and the last of
such a collection, then we deadlock, as the last will not be allocated
until the quiesce is lifted, the quiesce will not be lifted until the
first (which has been allocated) gets used, and that first cannot be
used until the last is allocated.

It is not necessary to inhibit reshape IO when a quiesce is
requested.  Those places in the code that require a full quiesce will
ensure the reshape thread is not running at all.

So allow reshape requests to get access to new stripe_heads without
being blocked by a 'quiesce'.

This only affects in-place reshapes (i.e. where the array does not
grow or shrink) and these are only newly supported.  So this patch is
not needed in earlier kernels.
Signed-off-by: NNeilBrown <neilb@suse.de>

a8c906ca

md/raid5: use conf->raid_disks in preference to mddev->raid_disk · f001a70c

由 NeilBrown 提交于 6月 09, 2009

mddev->raid_disks can be changed and any time by a request from
user-space.  It is a suggestion as to what number of raid_disks is
desired.

conf->raid_disks can only be changed by the raid5 module with suitable
locks in place.  It is a statement as to the current number of
raid_disks.

There are two places where the latter should be used, but the former
is used.  This can lead to a crash when reshaping an array.

This patch changes to mddev-> to conf->
Signed-off-by: NNeilBrown <neilb@suse.de>

f001a70c

Revert "block: Fix bounce limit setting in DM" · 9df1bb9b

由 Jens Axboe 提交于 6月 09, 2009

This reverts commit a05c0205.

DM doesn't need to access the bounce_pfn directly.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

9df1bb9b

03 6月, 2009 1 次提交

block: Fix bounce limit setting in DM · a05c0205

由 Martin K. Petersen 提交于 6月 03, 2009

blk_queue_bounce_limit() is more than a wrapper about the request queue
limits.bounce_pfn variable.  Introduce blk_queue_bounce_pfn() which can
be called by stacking drivers that wish to set the bounce limit
explicitly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a05c0205

27 5月, 2009 1 次提交

md: raid5: change incorrect usage of 'min' macro to 'min_t' · ed37d83e

由 NeilBrown 提交于 5月 27, 2009

A recent patch to raid5.c use min on an int and a sector_t.
This isn't allowed.
So change it to min_t(sector_t,x,y).
Signed-off-by: NNeilBrown <neilb@suse.de>

ed37d83e

26 5月, 2009 7 次提交

md: don't use locked_ioctl. · b492b852

由 NeilBrown 提交于 5月 26, 2009

md has no need for the BKL - it does its own locking.
So md_ioctl doesn't need to be a locked_ioctl.
Signed-off-by: NNeilBrown <neilb@suse.de>

b492b852

md: don't update curr_resync_completed without also updating reshape_position. · 7a91ee1f

由 NeilBrown 提交于 5月 26, 2009

In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.

The reshape code updates both at the same time.  However since
commit 97e4f42d
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NNeilBrown <neilb@suse.de>

7a91ee1f

md: raid5: avoid sector values going negative when testing reshape progress. · 848b3182

由 NeilBrown 提交于 5月 26, 2009

As sector_t in unsigned, we cannot afford to let 'safepos' etc go
negative.
So replace
   a -= b;
by
   a -= min(b,a);
Signed-off-by: NNeilBrown <neilb@suse.de>

848b3182

md: export 'frozen' resync state through sysfs · b6a9ce68

由 NeilBrown 提交于 5月 26, 2009

The md resync engine has a 'frozen' state which ensures that
no resync/recovery.  This is used to avoid races.

Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NNeilBrown <neilb@suse.de>

b6a9ce68

md: bitmap: improve bitmap maintenance code. · be512691

由 NeilBrown 提交于 5月 26, 2009

The code for checking which bits in the bitmap can be cleared
has 2 problems:
 1/ it repeatedly takes and drops a spinlock, where it would make
    more sense to just hold on to it most of the time.
 2/ it doesn't make use of some opportunities to skip large sections
    of the bitmap

This patch fixes those.  It will only affect CPU consumption, not
correctness.
Signed-off-by: NNeilBrown <neilb@suse.de>

be512691

md: improve errno return when setting array_size · 2b69c839

由 NeilBrown 提交于 5月 26, 2009

Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
  E2BIG
if the size requested is too large.  This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b69c839

md: always update level / chunk_size / layout when writing v1.x metadata. · 62e1e389

由 NeilBrown 提交于 5月 26, 2009

We previously didn't update these fields when writing the metadata
because they could never change.  They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NNeilBrown <neilb@suse.de>

62e1e389

23 5月, 2009 2 次提交

block: Use accessor functions for queue limits · ae03bf63

由 Martin K. Petersen 提交于 5月 22, 2009

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ae03bf63

block: Do away with the notion of hardsect_size · e1defc4f

由 Martin K. Petersen 提交于 5月 22, 2009

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e1defc4f

07 5月, 2009 2 次提交

md: remove rd%d links immediately after stopping an array. · c4647292

由 NeilBrown 提交于 5月 07, 2009

md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
   rd2 -> dev-sda

indicates that the device with role '2' in the array is sda.

These links are only present when the array is active.  They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.

So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NNeilBrown <neilb@suse.de>

c4647292

md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597

由 NeilBrown 提交于 5月 07, 2009

Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.

It is unnecessary because the same can be achieved by writing
'active'.  This activates and array, but it still remains 'clean'
until the first write.

It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').

Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races:  One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array.  Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.

So just disable the use of 'clean' to activate an array.

This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5bf29597