提交 · e71bf0d0ee89e51b92776391c5634938236977d5 · openeuler / Kernel

09 10月, 2008 39 次提交

block: fix disk->part[] dereferencing race · e71bf0d0

由 Tejun Heo 提交于 9月 03, 2008

disk->part[] is protected by its matching bdev's lock.  However,
non-critical accesses like collecting stats and printing out sysfs and
proc information used to be performed without any locking.  As
partitions can come and go dynamically, partitions can go away
underneath those non-critical accesses.  As some of those accesses are
writes, this theoretically can lead to silent corruption.

This patch fixes the race by using RCU for the partition array and dev
reference counter to hold partitions.

* Rename disk->part[] to disk->__part[] to make sure no one outside
  genhd layer proper accesses it directly.

* Use RCU for disk->__part[] dereferencing.

* Implement disk_{get|put}_part() which can be used to get and put
  partitions from gendisk respectively.

* Iterators are implemented to help iterate through all partitions
  safely.

* Functions which require RCU readlock are marked with _rcu suffix.

* Use disk_put_part() in __blkdev_put() instead of directly putting
  the contained kobject.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e71bf0d0

block: don't depend on consecutive minor space · f331c029

由 Tejun Heo 提交于 9月 03, 2008

* Implement disk_devt() and part_devt() and use them to directly
  access devt instead of computing it from ->major and ->first_minor.

  Note that all references to ->major and ->first_minor outside of
  block layer is used to determine devt of the disk (the part0) and as
  ->major and ->first_minor will continue to represent devt for the
  disk, converting these users aren't strictly necessary.  However,
  convert them for consistency.

* Implement disk_max_parts() to avoid directly deferencing
  genhd->minors.

* Update bdget_disk() such that it doesn't assume consecutive minor
  space.

* Move devt computation from register_disk() to add_disk() and make it
  the only one (all other usages use the initially determined value).

These changes clean up the code and will help disk->part dereference
fix and extended block device numbers.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

f331c029

block: make variable and argument names more consistent · cf771cb5

由 Tejun Heo 提交于 9月 03, 2008

In hd_struct, @partno is used to denote partition number and a number
of other places use @part to denote hd_struct.  Functions use @part
and @index instead.  This causes confusion and makes it difficult to
use consistent variable names for hd_struct.  Always use @partno if a
variable represents partition number.

Also, print out functions use @f or @part for seq_file argument.  Use
@seqf uniformly instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

cf771cb5

block: misc updates · 310a2c10

由 Tejun Heo 提交于 8月 25, 2008

This patch makes the following misc updates in preparation for
disk->part dereference fix and extended block devt support.

* implment part_to_disk()

* fix comment about gendisk->part indexing

* rename get_part() to disk_map_sector()

* don't use n which is always zero while printing disk information in
  diskstats_show()
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

310a2c10

block: update add_partition() error handling · 88e34126

由 Tejun Heo 提交于 8月 25, 2008

d805dda4 tried to fix error case handling in add_partition() but had a
few problems.

* disk->part[] entry is set early and left dangling if operation
  fails.

* Once device initialized, the last put_device() is responsible for
  freeing all the resources.  The failure path freed part_stats and p
  regardless of put_device() causing double free.

* holders subdir holds reference to the disk device, so failure path
  should remove it to release resources properly which was missing.

This patch fixes the above problems and while at it move partition
slot busy check into add_partition() for completeness and inlines
holders subdirectory creation.  Using separate function for it just
obfuscates the code.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Abdel Benamrouche <draconux@gmail.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

88e34126

block: allow deleting zero length partition · ec2cdedf

由 Tejun Heo 提交于 8月 25, 2008

delete_partition() was noop for zero length partition.  As the
addition code allows creating zero lenght partition and deletion is
assumed to always succeed, this causes memory leak for zero length
partitions.  Allow zero length partitions to end their meaningless
lives.

While at it, allow deleting zero lenght partition via
BLKPG_DEL_PARTITION ioctl too.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ec2cdedf

block: use class_dev_iterator instead of class_for_each_device() · def4e38d

由 Tejun Heo 提交于 9月 03, 2008

Recent block_class iteration updates 5c6f35c5..27f30251 converted all
class device iteration to class_for_each_device() and
class_find_device(), which are correct but pain in the ass to use.
This pach converts them to newly introduced class_dev_iterator so that
they can use more natural control structures instead of separate
callbacks and struct to pass parameters to them.

This results in smaller and easier code.

This patch also restores the original behavior of not printing header
in /proc/partitions if there's no partition to print.  This is trivial
but still user-visible behavior.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

def4e38d

block: don't grab block_class_lock unnecessarily · 2ac3cee5

由 Tejun Heo 提交于 9月 03, 2008

block_class_lock protects major_names array and bdev_map and doesn't
have anything to do with block class devices.  Don't grab them while
iterating over block class devices.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2ac3cee5

block: fix partition info printouts · ac65ece4

由 Tejun Heo 提交于 8月 25, 2008

Recent block_class iteration updates 5c6f35c5..27f30251 broke partition
info printouts.

* printk_all_partitions(): Partition print out stops when it meets a
  partition hole.  Partition printing inner loop should continue
  instead of exiting on empty partition slot.

* /proc/partitions and /proc/diskstats: If all information can't be
  read in single read(), the information is truncated.  This is
  because find_start() doesn't actually update the counter containing
  the initial seek.  It runs to the end and ends up always reporting
  EOF on the second read.

This patch fixes both problems.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ac65ece4

driver-core: use klist for class device list and implement iterator · 5a3ceb86

由 Tejun Heo 提交于 8月 25, 2008

Iterating over entries using callback usually isn't too fun especially
when the entry being iterated over can't be manipulated freely.  This
patch converts class->p->class_devices to klist and implements class
device iterator so that the users can freely build their own control
structure.  The users are also free to call back into class code
without worrying about locking.

class_for_each_device() and class_find_device() are converted to use
the new iterators, so their users don't have to worry about locking
anymore either.

Note: This depends on klist-dont-iterate-over-deleted-entries patch
because class_intf->add/remove_dev() depends on proper synchronization
with device removal.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5a3ceb86

klist: don't iterate over deleted entries · a1ed5b0c

由 Tejun Heo 提交于 8月 25, 2008

A klist entry is kept on the list till all its current iterations are
finished; however, a new iteration after deletion also iterates over
deleted entries as long as their reference count stays above zero.
This causes problems for cases where there are users which iterate
over the list while synchronized against list manipulations and
natuarally expect already deleted entries to not show up during
iteration.

This patch implements dead flag which gets set on deletion so that
iteration can skip already deleted entries.  The dead flag piggy backs
on the lowest bit of knode->n_klist and only visible to klist
implementation proper.

While at it, drop klist_iter->i_head as it's redundant and doesn't
offer anything in semantics or performance wise as klist_iter->i_klist
is dereferenced on every iteration anyway.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

a1ed5b0c

Add some block/ source files to the kernel-api docbook. Fix kernel-doc... · 710027a4

由 Randy Dunlap 提交于 8月 19, 2008

Add some block/ source files to the kernel-api docbook. Fix kernel-doc notation in them as needed. Fix changed function parameter names. Fix typos/spellos. In comments, change REQ_SPECIAL to REQ_TYPE_SPECIAL and REQ_BLOCK_PC to REQ_TYPE_BLOCK_PC.
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

710027a4

block: make bi_phys_segments an unsigned int instead of short · 5b99c2ff

由 Jens Axboe 提交于 8月 15, 2008

raid5 can overflow with more than 255 stripes, and we can increase it
to an int for free on both 32 and 64-bit archs due to the padding.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5b99c2ff

J
block: raid fixups for removal of bi_hw_segments · 960e739d
由 Jens Axboe 提交于 8月 15, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
960e739d

drop vmerge accounting · 5df97b91

由 Mikulas Patocka 提交于 8月 15, 2008

Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5df97b91

block: drop virtual merging accounting · b8b3e16c

由 Mikulas Patocka 提交于 8月 15, 2008

Remove virtual merge accounting.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

b8b3e16c

block: update documentation for deadline fifo_batch tunable · 6a421c1d

由 Aaron Carroll 提交于 8月 14, 2008

Update the description of fifo_batch to match the current implementation,
and include a description of how to tune it.
Signed-off-by: NAaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

6a421c1d

deadline-iosched: non-functional fixes · 4fb72f76

由 Aaron Carroll 提交于 8月 14, 2008

* convert goto to simpler while loop;
 * use rq_end_sector() instead of computing manually;
 * fix false comments;
 * remove spurious whitespace;
 * convert rq_rb_root macro to an inline function.
Signed-off-by: NAaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

4fb72f76

deadline-iosched: allow non-sequential batching · 63de428b

由 Aaron Carroll 提交于 8月 14, 2008

Deadline currently only batches sector-contiguous requests, so except
for a few circumstances (e.g. requests in a single direction), it is
essentially first come first served. This is bad for throughput, so
change it to CSCAN, which means requests in a batch do not need to be
sequential and are issued in increasing sector order.
Signed-off-by: NAaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

63de428b

virtio_blk: use a wrapper function to access io context information of IO requests · 766ca442

由 Fernando Luis Vázquez Cao 提交于 8月 14, 2008

struct request has an ioprio member but it is never updated because
currently bios do not hold io context information. The implication of
this is that virtio_blk ends up passing useless information to the
backend driver.

That said, some IO schedulers such as CFQ do store io context
information in struct request, but use private members for that, which
means that that information cannot be directly accessed in a IO
scheduler-independent way.

This patch adds a function to obtain the ioprio of a request. We should
avoid accessing ioprio directly and use this function instead, so that
its users do not have to care about future changes in block layer
structures or what the currently active IO controller is.

This patch does not introduce any functional changes but paves the way
for future clean-ups and enhancements.
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

766ca442

Kill REQ_TYPE_FLUSH · 1a8e2bdd

由 David Woodhouse 提交于 8月 13, 2008

It was only used by ps3disk, and it should probably have been
REQ_TYPE_LINUX_BLOCK + REQ_LB_OP_FLUSH.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1a8e2bdd

Allow elevators to sort/merge discard requests · e17fc0a1

由 David Woodhouse 提交于 8月 09, 2008

But blkdev_issue_discard() still emits requests which are interpreted as
soft barriers, because naïve callers might otherwise issue subsequent
writes to those same sectors, which might cross on the queue (if they're
reallocated quickly enough).

Callers still _can_ issue non-barrier discard requests, but they have to
take care of queue ordering for themselves.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e17fc0a1

Add BLKDISCARD ioctl to allow userspace to discard sectors · d30a2605

由 David Woodhouse 提交于 8月 11, 2008

We may well want mkfs tools to use this to mark the whole device as
unwanted before they format it, for example.

The ioctl takes a pair of uint64_ts, which are start offset and length
in _bytes_. Although at the moment it might make sense for them both to
be in 512-byte sectors, I don't want to limit the ABI to that.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

d30a2605

Use WRITE_BARRIER in blkdev_issue_flush(), not (1<<BIO_RW_BARRIER) · 2ebca85a

由 OGAWA Hirofumi 提交于 8月 11, 2008

Barriers should be submitted with the WRITE flag set.
Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2ebca85a

blktrace: simplify flags handling in __blk_add_trace · 35ba8f70

由 David Woodhouse 提交于 8月 10, 2008

Let the compiler see what's going on, and it can all get a lot simpler.
On PPC64 this reduces the size of the code calculating these bits by
about 60%. On x86_64 it's less of a win -- only 40%.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

35ba8f70

blktrace: support discard requests · 27b29e86

由 David Woodhouse 提交于 8月 10, 2008

Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

27b29e86

Support 'discard sectors' operation. · fdc53971

由 David Woodhouse 提交于 8月 05, 2008

We can benefit from knowing that the file system no longer cares about
the contents of certain sectors, by throwing them away immediately and
then never having to garbage collect them, and using the extra free
space to make our operations more efficient. Do so.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

fdc53971

D
Support 'discard sectors' operation in translation layer support core · eae9acd1
由 David Woodhouse 提交于 8月 05, 2008
```
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
eae9acd1

Let the block device know when sectors can be discarded · 8c540a96

由 David Woodhouse 提交于 8月 05, 2008

[hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains]
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

8c540a96

Add 'discard' request handling · fb2dce86

由 David Woodhouse 提交于 8月 05, 2008

Some block devices benefit from a hint that they can forget the contents
of certain sectors. Add basic support for this to the block core, along
with a 'blkdev_issue_discard()' helper function which issues such
requests.

The caller doesn't get to provide an end_io functio, since
blkdev_issue_discard() will automatically split the request up into
multiple bios if appropriate. Neither does the function wait for
completion -- it's expected that callers won't care about when, or even
_if_, the request completes. It's only a hint to the device anyway. By
definition, the file system doesn't _care_ about these sectors any more.

[With feedback from OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> and
Jens Axboe <jens.axboe@oracle.com]
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

fb2dce86

Fix up comments about matching flags between bio and rq · d628eaef

由 David Woodhouse 提交于 8月 09, 2008

Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

d628eaef

J
highmem: use bio_has_data() in the bounce path · 36144077
由 Jens Axboe 提交于 8月 14, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
36144077
J
block: use bio_has_data() in the IO completion path · 051cc395
由 Jens Axboe 提交于 8月 08, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
051cc395
J
block: use bio_has_data() to check for data carrying bio · a9c701e5
由 Jens Axboe 提交于 8月 08, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
a9c701e5
J
block: add bio_has_data() to detect whether a bio carries data or not · 7a67f63b
由 Jens Axboe 提交于 8月 08, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
7a67f63b

SG_IO block filter whitelist missing MMC SET READ AHEAD command · 35e396cd

由 xiphmont@xiph.org 提交于 8月 22, 2008

I have another request for the block filter SG_IO command whitelist,
specifically the MMC streaming command set SET READ AHEAD command.
The command applies only to MMC CDROM/DVDROM drives with the streaming
optional feature set.  The command is useful to cdparanoia in that it
allows explicit cache control side effects that are, on many drives,
cdparanoia's most efficient way to flush/disable the media cache on
cdrom drives. I am aware of no reason why it should not be accessible
from usespace.

Also note that the command is already fully accessible through the
SCSI-native version of the SG_IO ioctl as well as the traditional SG
interface.  The command is only being refused on block devices.  That
means that on a typical stock distro, the command is available through
/dev/sg* but not /dev/scd* although both are typically available and
accessible.  Filtering the command is not providing any protection,
only a confusing inconsistency.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

35e396cd

Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus · 69849375

由 Linus Torvalds 提交于 10月 08, 2008

* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
  [MIPS] Sibyte: Register PIO PATA device only for Swarm and Litte Sur

69849375

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · 392eaef2

由 Linus Torvalds 提交于 10月 08, 2008

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  tcp: Fix tcp_hybla zero congestion window growth with small rho and large cwnd.
  net: Fix netdev_run_todo dead-lock
  tcp: Fix possible double-ack w/ user dma
  net: only invoke dev->change_rx_flags when device is UP
  netrom: Fix sock_orphan() use in nr_release
  ax25: Quick fix for making sure unaccepted sockets get destroyed.
  Revert "ax25: Fix std timer socket destroy handling."
  [Bluetooth] Add reset quirk for A-Link BlueUSB21 dongle
  [Bluetooth] Add reset quirk for new Targus and Belkin dongles
  [Bluetooth] Fix double frees on error paths of btusb and bpa10x drivers

392eaef2

[MIPS] Sibyte: Register PIO PATA device only for Swarm and Litte Sur · 88060488

由 Ralf Baechle 提交于 10月 08, 2008

Symbol name spaghetti which is too complicated to cleanup on this stage
of the release cycle breaks the build on BCM1480 platforms.
Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

88060488

08 10月, 2008 1 次提交

tcp: Fix tcp_hybla zero congestion window growth with small rho and large cwnd. · 9d2c27e1

由 Daniele Lacamera 提交于 10月 07, 2008

Because of rounding, in certain conditions, i.e. when in congestion
avoidance state rho is smaller than 1/128 of the current cwnd, TCP
Hybla congestion control starves and the cwnd is kept constant
forever.

This patch forces an increment by one segment after #send_cwnd calls
without increments(newreno behavior).
Signed-off-by: NDaniele Lacamera <root@danielinux.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d2c27e1

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功