提交 · ebb351cf78ee6bf3262e6b4b6906f456c05e6d5e · openanolis / cloud-kernel

18 12月, 2012 1 次提交

llist/xen-blkfront: implement safe version of llist_for_each_entry · ebb351cf

由 Roger Pau Monne 提交于 12月 04, 2012

Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
[v2: Move the llist_for_each_entry_safe in llist.h]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ebb351cf

08 12月, 2012 1 次提交

xen-blkback: implement safe iterator for the list of persistent grants · 7dc34117

由 Roger Pau Monne 提交于 12月 04, 2012

Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

7dc34117

27 11月, 2012 2 次提交

xen-blkfront: free allocated page · 07c540a0

由 Roger Pau Monne 提交于 11月 16, 2012

Free the page allocated for the persistent grant.
Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

07c540a0

xen-blkback: move free persistent grants code · 4d4f270f

由 Roger Pau Monne 提交于 11月 16, 2012

Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.
Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

4d4f270f

04 11月, 2012 1 次提交

xen/blkback: persistent-grants fixes · cb5bd4d1

由 Roger Pau Monne 提交于 11月 02, 2012

This patch contains fixes for persistent grants implementation v2:

 * handle == 0 is a valid handle, so initialize grants in blkback
   setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
   by Konrad Rzeszutek Wilk.

 * new_map is a boolean, use "true" or "false" instead of 1 and 0.
   Reported by Konrad Rzeszutek Wilk.

 * blkfront announces the persistent-grants feature as
   feature-persistent-grants, use feature-persistent instead which is
   consistent with blkback and the public Xen headers.

 * Add a consistency check in blkfront to make sure we don't try to
   access segments that have not been set.
Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

cb5bd4d1

30 10月, 2012 1 次提交

xen/blkback: Persistent grant maps for xen blk drivers · 0a8704a5

由 Roger Pau Monne 提交于 10月 24, 2012

This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.
Signed-off-by: NOliver Chick <oliver.chick@citrix.com>
Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]

0a8704a5

06 10月, 2012 21 次提交

aoe: update aoe-internal version number to 50 · 322c9ec0

由 Ed Cashin 提交于 10月 04, 2012

Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

322c9ec0

aoe: remove unused code · 1ac9e602

由 Ed Cashin 提交于 10月 04, 2012

Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1ac9e602

aoe: make dynamic block minor numbers the default · 08b60623

由 Ed Cashin 提交于 10月 04, 2012

Because udev use is so widespread, making the old static mapping the
default is too conservative, given the severe limitations it places on
usable AoE addresses.  Storage virtualization and larger shelves have made
the old limitations too confining.

These changes make the dynamic block device minor numbers the default,
removing the limitations on usable AoE addresses.

The static arrangement is still available with aoe_dyndevs=0, and the
aoe-stat tool from the userland aoetools package, the user space
counterpart to the aoe driver, recognizes the case where there is a
mismatch between the minor number in sysfs and the minor number in a
special device file.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

08b60623

aoe: update and specify AoE address guards and error messages · 7159e969

由 Ed Cashin 提交于 10月 04, 2012

In general, specific is better when it comes to messages about AoE usage
problems.  Also, explicit checks for the AoE broadcast addresses are
added.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7159e969

aoe: retain static block device numbers for backwards compatibility · 4bcce1a3

由 Ed Cashin 提交于 10月 04, 2012

The old mapping between AoE target shelf and slot addresses and the block
device minor number is retained as a backwards-compatible feature, with a
new "aoe_dyndevs" module parameter available for enabling dynamic block
device minor numbers.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4bcce1a3

aoe: support more AoE addresses with dynamic block device minor numbers · 0c966214

由 Ed Cashin 提交于 10月 04, 2012

The ATA over Ethernet protocol uses a major (shelf) and minor (slot)
address to identify a particular storage target. These changes remove an
artificial limitation the aoe driver imposes on the use of AoE addresses.
For example, without these changes, the slot address has a maximum of 15,
but users commonly use slot numbers much greater than that.

The AoE shelf and slot address space is often used sparsely. Instead of
using a static mapping between AoE addresses and the block device minor
number, the block device minor numbers are now allocated on demand.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0c966214

aoe: update copyright year in touched files · fea05a26

由 Ed Cashin 提交于 10月 04, 2012

Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fea05a26

aoe: update internal version number to 49 · 7392fbe5

由 Ed Cashin 提交于 10月 04, 2012

The internal version number of the aoe driver appears in a console message
when the driver loads and is usually obtained by the user with the
userland aoe-version tool, part of the aoetools.[1]

Although this patchset includes bugfixes backported from higher-numbered
versions published on the coraid.com website, it is a form of version 49.

1. http://aoetools.sourceforge.net/Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7392fbe5

aoe: remove unused code and add cosmetic improvements · b21faa25

由 Ed Cashin 提交于 10月 04, 2012

This change removes some unused code and attempts to increase code
consistency.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b21faa25

aoe: increase net_device reference count while using it · 1b86fda9

由 Ed Cashin 提交于 10月 04, 2012

This change eliminates the danger that the user could rmmod the driver for
a network interface that is being used for AoE by the aoe driver.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1b86fda9

aoe: associate frames with the AoE storage target · 64a80f5a

由 Ed Cashin 提交于 10月 04, 2012

In the driver code, "target" and aoetgt refer to a particular remote
interface on the AoE storage target. The latter is identified by its AoE
major and minor addresses. Commands that are being sent to an AoE storage
target {major, minor} can be sent or retransmitted to any of the remote
MAC addresses associated with the AoE storage target.

That is, frames are naturally associated with not an aoetgt (AoE major,
AoE minor, remote MAC address) but an aoedev (AoE major, AoE minor).
Making the code reflect that reality simplifies the driver, especially
when the path to a remote MAC address becomes unusable.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

64a80f5a

aoe: disallow unsupported AoE minor addresses · 6583303c

由 Ed Cashin 提交于 10月 04, 2012

A guard is inserted to prevent AoE minor addresses (slot addresses) higher
than 15 to be used, as they are not yet supported by the driver.

There is a change coming that will allow the aoe driver to overcome this
limit by using system device minor numbers dynamically, but until then,
this guard prevents unexpected targets from being used by the driver when
AoE targets with high minor numbers are on the AoE network.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6583303c

aoe: do revalidation steps in order · 25f4d75e

由 Ed Cashin 提交于 10月 04, 2012

The discovery process begins with an optional AoE config query command and
an AoE config query response. Normally when an aoe device is already
open, the config query response does not trigger an ATA identify device
command to be sent out, since the response contains storage capacity
information that, if changed, could surprise the user of the device.

The userland "aoe-revalidate" tool uses a character device to trigger an
AoE config query for a particular AoE storage target and an ATA device
identify command, even when the device is open.

This change causes the config query to go out first, reflecting the normal
discovery sequence. The responses could come back in any order, so this
change is fairly cosmetic.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

25f4d75e

aoe: failover remote interface based on aoe_deadsecs parameter · d54d35ac

由 Ed Cashin 提交于 10月 04, 2012

The aoe_deadsecs module parameter allows the user to specify a hard limit
on the number of seconds an AoE command can be retransmitted before the
AoE block device is considered to have failed.

Using aoe_deadsecs to determine the time we try using a different remote
interface helps to ensure that the hard limit is not reached before we've
tried to recover by sending to a different remote port.

As a data storage target, the AoE target is unambiguously identified by
its {major, minor} AoE address tuple, and an AoE target can have multiple
MAC addresses.  However, note that "target" in the driver code and
comments means a {major, minor, MAC address} tuple, as in "somewhere to
send packets".
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d54d35ac

aoe: use packets that work with the smallest-MTU local interface · 3f0f0133

由 Ed Cashin 提交于 10月 04, 2012

Users with several network interfaces dedicated to AoE generally do not
configure them to support different-sized AoE data payloads on purpose.

For a given AoE target, there will be a set of local network interfaces
that can reach it. Using only the payload that will fit in the
smallest-sized MTU of all those local interfaces greatly simplifies the
driver, especially in failure scenarios.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3f0f0133

aoe: use a kernel thread for transmissions · eb086ec5

由 Ed Cashin 提交于 10月 04, 2012

The dev_queue_xmit function needs to have interrupts enabled, so the most
simple way to get the locking right but still fulfill that requirement is
to use a process that can call dev_queue_xmit serially over queued
transmissions.
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eb086ec5

aoe: become I/O request queue handler for increased user control · 69cf2d85

由 Ed Cashin 提交于 10月 04, 2012

To allow users to choose an elevator algorithm for their particular
workloads, change from a make_request-style driver to an
I/O-request-queue-handler-style driver.

We have to do a couple of things that might be surprising.  We manipulate
the page _count directly on the assumption that we still have no guarantee
that users of the block layer are prohibited from submitting bios
containing pages with zero reference counts.[1] If such a prohibition now
exists, I can get rid of the _count manipulation.

Just as before this patch, we still keep track of the sk_buffs that the
network layer still hasn't finished yet and cap the resources we use with
a "pool" of skbs.[2]

Now that the block layer maintains the disk stats, the aoe driver's
diskstats function can go away.

1. https://lkml.org/lkml/2007/3/1/374
2. https://lkml.org/lkml/2007/7/6/241Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69cf2d85

aoe: kernel thread handles I/O completions for simple locking · 896831f5

由 Ed Cashin 提交于 10月 04, 2012

Make the frames the aoe driver uses to track the relationship between bios
and packets more flexible and detached, so that they can be passed to an
"aoe_ktio" thread for completion of I/O.

The frames are handled much like skbs, with a capped amount of
preallocation so that real-world use cases are likely to run smoothly and
degenerate gracefully even under memory pressure.

Decoupling I/O completion from the receive path and serializing it in a
process makes it easier to think about the correctness of the locking in
the driver, especially in the case of a remote MAC address becoming
unusable.

[dan.carpenter@oracle.com: cleanup an allocation a bit]
Signed-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

896831f5

aoe: for performance support larger packet payloads · 3d5b0605

由 Ed Cashin 提交于 10月 04, 2012

tAdd adds the ability to work with large packets composed of a number of
segments, using the scatter gather feature of the block layer (biovecs)
and the network layer (skb frag array). The motivation is the performance
gained by using a packet data payload greater than a page size and by
using the network card's scatter gather feature.

Users of the out-of-tree aoe driver already had these changes, but since
early 2011, they have complained of increased memory utilization and
higher CPU utilization during heavy writes.[1] The commit below appears
related, as it disables scatter gather on non-IP protocols inside the
harmonize_features function, even when the NIC supports sg.

commit f01a5236
Author: Jesse Gross <jesse@nicira.com>
Date: Sun Jan 9 06:23:31 2011 +0000

net offloading: Generalize netif_get_vlan_features().

With that regression in place, transmits always linearize sg AoE packets,
but in-kernel users did not have this patch. Before 2.6.38, though, these
changes were working to allow sg to increase performance.

1. http://www.spinics.net/lists/linux-mm/msg15184.htmlSigned-off-by: NEd Cashin <ecashin@coraid.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3d5b0605

nbd: handle discard requests · a336d298

由 Paul Clements 提交于 10月 04, 2012

Add discard support to nbd. If the nbd-server supports discard, it will
send NBD_FLAG_SEND_TRIM to the client. The client will then set the flag
in the kernel via NBD_SET_FLAGS, which tells the kernel to enable discards
for the device (QUEUE_FLAG_DISCARD).

If discard support is enabled, then when the nbd client system receives a
discard request, this will be passed along to the nbd-server. When the
discard request is received by the nbd-server, it will perform:

fallocate(.. FALLOC_FL_PUNCH_HOLE ..)

To punch a hole in the backend storage, which is no longer needed.
Signed-off-by: NPaul Clements <paul.clements@steeleye.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a336d298

nbd: add set flags ioctl · 2f012508

由 Paul Clements 提交于 10月 04, 2012

Add a set-flags ioctl, allowing various option flags to be set on an nbd
device. This allows the nbd-client to set the device flags (to enable
read-only mode, or enable discard support, etc.).

Flags are typically specified by the nbd-server. During the negotiation
phase of the nbd connection, the server sends its flags to the client.
The client then uses NBD_SET_FLAGS to inform the kernel of the options.

Also included is a one-line fix to debug output for the set-timeout ioctl.
Signed-off-by: NPaul Clements <paul.clements@steeleye.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2f012508

02 10月, 2012 13 次提交

rbd: BUG on invalid layout · 6cae3717

由 Sage Weil 提交于 9月 24, 2012

This shouldn't actually be possible because the layout struct is
constructed from the RBD header and validated then.

[elder@inktank.com: converted BUG() call to equivalent rbd_assert()]
Signed-off-by: NSage Weil <sage@inktank.com>
Reviewed-by: NAlex Elder <elder@inktank.com>

6cae3717

rbd: update remaining header fields for v2 · 6e14b1a6

由 Alex Elder 提交于 7月 03, 2012

There are three fields that are not yet updated for format 2 rbd
image headers:  the version of the header object; the encryption
type; and the compression type.  There is no interface defined for
fetching the latter two, so just initialize them explicitly to 0 for
now.

Change rbd_dev_v2_snap_context() so the caller can be supplied the
version for the header object.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

6e14b1a6

rbd: get snapshot name for a v2 image · b8b1e2db

由 Alex Elder 提交于 7月 03, 2012

Define rbd_dev_v2_snap_name() to fetch the name for a particular
snapshot in a format 2 rbd image.

Define rbd_dev_v2_snap_info() to to be a wrapper for getting the
name, size, and features for a particular snapshot, using an
interface that matches the equivalent function for version 1 images.

Define rbd_dev_snap_info() wrapper function and use it to call the
appropriate function for getting the snapshot name, size, and
features, dependent on the rbd image format.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

b8b1e2db

rbd: get the snapshot context for a v2 image · 35d489f9

由 Alex Elder 提交于 7月 03, 2012

Fetch the snapshot context for an rbd format 2 image by calling
the "get_snapcontext" method on its header object.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

35d489f9

rbd: get image features for a v2 image · b1b5402a

由 Alex Elder 提交于 7月 03, 2012

The features values for an rbd format 2 image are fetched from the
server using a "get_features" method.  The same method is used for
getting the features for a snapshot, so structure this addition with
a generic helper routine that can get this information for either.

The server will provide two 64-bit feature masks, one representing
the features potentially in use for this image (or its snapshot),
and one representing features that must be supported by the client
in order to work with the image.

For the time being, neither of these is really used so we keep
things simple and just record the first feature vector.  Once we
start using these feature masks, what we record and what we expose
to the user will most likely change.
Signed-off-by: NAlex Elder <elder@inktank.com>

b1b5402a

rbd: get the object prefix for a v2 rbd image · 1e130199

由 Alex Elder 提交于 7月 03, 2012

The object prefix of an rbd format 2 image is fetched from the
server using a "get_object_prefix" method.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

1e130199

rbd: add code to get the size of a v2 rbd image · 9d475de5

由 Alex Elder 提交于 7月 03, 2012

The size of an rbd format 2 image is fetched from the server using a
"get_size" method.  The same method is used for getting the size of
a snapshot, so structure this addition with a generic helper routine
that we can get this information for either.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

9d475de5

rbd: lay out header probe infrastructure · a30b71b9

由 Alex Elder 提交于 7月 10, 2012

This defines a new function rbd_dev_probe() as a top-level
function for populating detailed information about an rbd device.

It first checks for the existence of a format 2 rbd image id object.
If it exists, the image is assumed to be a format 2 rbd image, and
another function rbd_dev_v2() is called to finish populating
header data for that image.  If it does not exist, it is assumed to
be an old (format 1) rbd image, and calls a similar function
rbd_dev_v1() to populate its header information.

A new field, rbd_dev->format, is defined to record which version
of the rbd image format the device represents.  For a valid mapped
rbd device it will have one of two values, 1 or 2.

So far, the format 2 images are not really supported; this is
laying out the infrastructure for fleshing out that support.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

a30b71b9

rbd: encapsulate code that gets snapshot info · cd892126

由 Alex Elder 提交于 7月 03, 2012

Create a function that encapsulates looking up the name, size and
features related to a given snapshot, which is indicated by its
index in an rbd device's snapshot context array of snapshot ids.

This interface will be used to hide differences between the format 1
and format 2 images.

At the moment this (looking up the name anyway) is slightly less
efficient than what's done currently, but we may be able to optimize
this a bit later on by cacheing the last lookup if it proves to be a
problem.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

cd892126

rbd: add an rbd features field · 34b13184

由 Alex Elder 提交于 7月 13, 2012

Record the features values for each rbd image and each of its
snapshots.  This is really something that only becomes meaningful
for version 2 images, so this is just putting in place code
that will form common infrastructure.

It may be useful to expand the sysfs entries--and therefore the
information we maintain--for the image and for each snapshot.
But I'm going to hold off doing that until we start making
active use of the feature bits.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

34b13184

rbd: don't use index in __rbd_add_snap_dev() · c8d18425

由 Alex Elder 提交于 7月 10, 2012

Pass the snapshot id and snapshot size rather than an index
to __rbd_add_snap_dev() to specify values for a new snapshot.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

c8d18425

rbd: kill create_snap sysfs entry · 02cdb02c

由 Alex Elder 提交于 8月 10, 2012

Josh proposed the following change, and I don't think I could
explain it any better than he did:

    From: Josh Durgin <josh.durgin@inktank.com>
    Date: Tue, 24 Jul 2012 14:22:11 -0700
    To: ceph-devel <ceph-devel@vger.kernel.org>
    Message-ID: <500F1203.9050605@inktank.com>

    Right now the kernel still has one piece of rbd management
    duplicated from the rbd command line tool: snapshot creation.
    There's nothing special about snapshot creation that makes it
    advantageous to do from the kernel, so I'd like to remove the
    create_snap sysfs interface.  That is,
	/sys/bus/rbd/devices/<id>/create_snap
    would be removed.

    Does anyone rely on the sysfs interface for creating rbd
    snapshots?  If so, how hard would it be to replace with:

	rbd snap create pool/image@snap

    Is there any benefit to the sysfs interface that I'm missing?

    Josh

This patch implements this proposal, removing the code that
implements the "snap_create" sysfs interface for rbd images.
As a result, quite a lot of other supporting code goes away.
Suggested-by: NJosh Durgin <josh.durgin@inktank.com>
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

02cdb02c

rbd: define rbd_dev_image_id() · 589d30e0

由 Alex Elder 提交于 7月 10, 2012

New format 2 rbd images are permanently identified by a unique image
id.  Each rbd image also has a name, but the name can be changed.
A format 2 rbd image will have an object--whose name is based on the
image name--which maps an image's name to its image id.

Create a new function rbd_dev_image_id() that checks for the
existence of the image id object, and if it's found, records the
image id in the rbd_device structure.

Create a new rbd device attribute (/sys/bus/rbd/<num>/image_id) that
makes this information available.
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

589d30e0

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功