提交 3a564bb3 编写于 作者: L Linus Torvalds

Merge tag 'for-4.13/dm-changes' of...

Merge tag 'for-4.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add the ability to use select or poll /dev/mapper/control to wait for
   events from multiple DM devices.

 - Convert DM's printk macros over to using pr_<level> macros.

 - Add a big-endian variant of plain64 IV to dm-crypt.

 - Add support for zoned (aka SMR) devices to DM core. DM kcopyd was
   also improved to provide a sequential write feature needed by zoned
   devices.

 - Introduce DM zoned target that provides support for host-managed
   zoned devices, the result dm-zoned device acts as a drive-managed
   interface to the underlying host-managed device.

 - A DM raid fix to avoid using BUG() for error handling.

* tag 'for-4.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm zoned: fix overflow when converting zone ID to sectors
  dm raid: stop using BUG() in __rdev_sectors()
  dm zoned: drive-managed zoned block device target
  dm kcopyd: add sequential write feature
  dm linear: add support for zoned block devices
  dm flakey: add support for zoned block devices
  dm: introduce dm_remap_zone_report()
  dm: fix REQ_OP_ZONE_REPORT bio handling
  dm: fix REQ_OP_ZONE_RESET bio handling
  dm table: add zoned block devices validation
  dm: convert DM printk macros to pr_<level> macros
  dm crypt: add big-endian variant of plain64 IV
  dm bio prison: use rb_entry() rather than container_of()
  dm ioctl: report event number in DM_LIST_DEVICES
  dm ioctl: add a new DM_DEV_ARM_POLL ioctl
  dm: add basic support for using the select or poll function
dm-zoned
========
The dm-zoned device mapper target exposes a zoned block device (ZBC and
ZAC compliant devices) as a regular block device without any write
pattern constraints. In effect, it implements a drive-managed zoned
block device which hides from the user (a file system or an application
doing raw block device accesses) the sequential write constraints of
host-managed zoned block devices and can mitigate the potential
device-side performance degradation due to excessive random writes on
host-aware zoned block devices.
For a more detailed description of the zoned block device models and
their constraints see (for SCSI devices):
http://www.t10.org/drafts.htm#ZBC_Family
and (for ATA devices):
http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
The dm-zoned implementation is simple and minimizes system overhead (CPU
and memory usage as well as storage capacity loss). For a 10TB
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
instance is at most 4.5 MB and as little as 5 zones will be used
internally for storing metadata and performaing reclaim operations.
dm-zoned target devices are formatted and checked using the dmzadm
utility available at:
https://github.com/hgst/dm-zoned-tools
Algorithm
=========
dm-zoned implements an on-disk buffering scheme to handle non-sequential
write accesses to the sequential zones of a zoned block device.
Conventional zones are used for caching as well as for storing internal
metadata.
The zones of the device are separated into 2 types:
1) Metadata zones: these are conventional zones used to store metadata.
Metadata zones are not reported as useable capacity to the user.
2) Data zones: all remaining zones, the vast majority of which will be
sequential zones used exclusively to store user data. The conventional
zones of the device may be used also for buffering user random writes.
Data in these zones may be directly mapped to the conventional zone, but
later moved to a sequential zone so that the conventional zone can be
reused for buffering incoming random writes.
dm-zoned exposes a logical device with a sector size of 4096 bytes,
irrespective of the physical sector size of the backend zoned block
device being used. This allows reducing the amount of metadata needed to
manage valid blocks (blocks written).
The on-disk metadata format is as follows:
1) The first block of the first conventional zone found contains the
super block which describes the on disk amount and position of metadata
blocks.
2) Following the super block, a set of blocks is used to describe the
mapping of the logical device blocks. The mapping is done per chunk of
blocks, with the chunk size equal to the zoned block device size. The
mapping table is indexed by chunk number and each mapping entry
indicates the zone number of the device storing the chunk of data. Each
mapping entry may also indicate if the zone number of a conventional
zone used to buffer random modification to the data zone.
3) A set of blocks used to store bitmaps indicating the validity of
blocks in the data zones follows the mapping table. A valid block is
defined as a block that was written and not discarded. For a buffered
data chunk, a block is always valid only in the data zone mapping the
chunk or in the buffer zone of the chunk.
For a logical chunk mapped to a conventional zone, all write operations
are processed by directly writing to the zone. If the mapping zone is a
sequential zone, the write operation is processed directly only if the
write offset within the logical chunk is equal to the write pointer
offset within of the sequential data zone (i.e. the write operation is
aligned on the zone write pointer). Otherwise, write operations are
processed indirectly using a buffer zone. In that case, an unused
conventional zone is allocated and assigned to the chunk being
accessed. Writing a block to the buffer zone of a chunk will
automatically invalidate the same block in the sequential zone mapping
the chunk. If all blocks of the sequential zone become invalid, the zone
is freed and the chunk buffer zone becomes the primary zone mapping the
chunk, resulting in native random write performance similar to a regular
block device.
Read operations are processed according to the block validity
information provided by the bitmaps. Valid blocks are read either from
the sequential zone mapping a chunk, or if the chunk is buffered, from
the buffer zone assigned. If the accessed chunk has no mapping, or the
accessed blocks are invalid, the read buffer is zeroed and the read
operation terminated.
After some time, the limited number of convnetional zones available may
be exhausted (all used to map chunks or buffer sequential zones) and
unaligned writes to unbuffered chunks become impossible. To avoid this
situation, a reclaim process regularly scans used conventional zones and
tries to reclaim the least recently used zones by copying the valid
blocks of the buffer zone to a free sequential zone. Once the copy
completes, the chunk mapping is updated to point to the sequential zone
and the buffer zone freed for reuse.
Metadata Protection
===================
To protect metadata against corruption in case of sudden power loss or
system crash, 2 sets of metadata zones are used. One set, the primary
set, is used as the main metadata region, while the secondary set is
used as a staging area. Modified metadata is first written to the
secondary set and validated by updating the super block in the secondary
set, a generation counter is used to indicate that this set contains the
newest metadata. Once this operation completes, in place of metadata
block updates can be done in the primary metadata set. This ensures that
one of the set is always consistent (all modifications committed or none
at all). Flush operations are used as a commit point. Upon reception of
a flush request, metadata modification activity is temporarily blocked
(for both incoming BIO processing and reclaim process) and all dirty
metadata blocks are staged and updated. Normal operation is then
resumed. Flushing metadata thus only temporarily delays write and
discard requests. Read requests can be processed concurrently while
metadata flush is being executed.
Usage
=====
A zoned block device must first be formatted using the dmzadm tool. This
will analyze the device zone configuration, determine where to place the
metadata sets on the device and initialize the metadata sets.
Ex:
dmzadm --format /dev/sdxx
For a formatted device, the target can be created normally with the
dmsetup utility. The only parameter that dm-zoned requires is the
underlying zoned block device name. Ex:
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
......@@ -521,6 +521,23 @@ config DM_INTEGRITY
To compile this code as a module, choose M here: the module will
be called dm-integrity.
config DM_ZONED
tristate "Drive-managed zoned block device target support"
depends on BLK_DEV_DM
depends on BLK_DEV_ZONED
---help---
This device-mapper target takes a host-managed or host-aware zoned
block device and exposes most of its capacity as a regular block
device (drive-managed zoned block device) without any write
constraints. This is mainly intended for use with file systems that
do not natively support zoned block devices but still want to
benefit from the increased capacity offered by SMR disks. Other uses
by applications using raw block devices (for example object stores)
are also possible.
To compile this code as a module, choose M here: the module will
be called dm-zoned.
If unsure, say N.
endif # MD
......@@ -20,6 +20,7 @@ dm-era-y += dm-era-target.o
dm-verity-y += dm-verity-target.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
# Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise
......@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_ERA) += dm-era.o
obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o
obj-$(CONFIG_DM_ZONED) += dm-zoned.o
ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
......
......@@ -116,7 +116,7 @@ static int __bio_detain(struct dm_bio_prison *prison,
while (*new) {
struct dm_bio_prison_cell *cell =
container_of(*new, struct dm_bio_prison_cell, node);
rb_entry(*new, struct dm_bio_prison_cell, node);
r = cmp_keys(key, &cell->key);
......
......@@ -120,7 +120,7 @@ static bool __find_or_insert(struct dm_bio_prison_v2 *prison,
while (*new) {
struct dm_bio_prison_cell_v2 *cell =
container_of(*new, struct dm_bio_prison_cell_v2, node);
rb_entry(*new, struct dm_bio_prison_cell_v2, node);
r = cmp_keys(key, &cell->key);
......
......@@ -147,4 +147,7 @@ static inline bool dm_message_test_buffer_overflow(char *result, unsigned maxlen
return !maxlen || strlen(result) + 1 >= maxlen;
}
extern atomic_t dm_global_event_nr;
extern wait_queue_head_t dm_global_eventq;
#endif
......@@ -246,6 +246,9 @@ static struct crypto_aead *any_tfm_aead(struct crypt_config *cc)
* plain64: the initial vector is the 64-bit little-endian version of the sector
* number, padded with zeros if necessary.
*
* plain64be: the initial vector is the 64-bit big-endian version of the sector
* number, padded with zeros if necessary.
*
* essiv: "encrypted sector|salt initial vector", the sector number is
* encrypted with the bulk cipher using a salt as key. The salt
* should be derived from the bulk cipher's key via hashing.
......@@ -302,6 +305,16 @@ static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
return 0;
}
static int crypt_iv_plain64be_gen(struct crypt_config *cc, u8 *iv,
struct dm_crypt_request *dmreq)
{
memset(iv, 0, cc->iv_size);
/* iv_size is at least of size u64; usually it is 16 bytes */
*(__be64 *)&iv[cc->iv_size - sizeof(u64)] = cpu_to_be64(dmreq->iv_sector);
return 0;
}
/* Initialise ESSIV - compute salt but no local memory allocations */
static int crypt_iv_essiv_init(struct crypt_config *cc)
{
......@@ -835,6 +848,10 @@ static const struct crypt_iv_operations crypt_iv_plain64_ops = {
.generator = crypt_iv_plain64_gen
};
static const struct crypt_iv_operations crypt_iv_plain64be_ops = {
.generator = crypt_iv_plain64be_gen
};
static const struct crypt_iv_operations crypt_iv_essiv_ops = {
.ctr = crypt_iv_essiv_ctr,
.dtr = crypt_iv_essiv_dtr,
......@@ -2208,6 +2225,8 @@ static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
cc->iv_gen_ops = &crypt_iv_plain_ops;
else if (strcmp(ivmode, "plain64") == 0)
cc->iv_gen_ops = &crypt_iv_plain64_ops;
else if (strcmp(ivmode, "plain64be") == 0)
cc->iv_gen_ops = &crypt_iv_plain64be_ops;
else if (strcmp(ivmode, "essiv") == 0)
cc->iv_gen_ops = &crypt_iv_essiv_ops;
else if (strcmp(ivmode, "benbi") == 0)
......@@ -2987,7 +3006,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
static struct target_type crypt_target = {
.name = "crypt",
.version = {1, 17, 0},
.version = {1, 18, 0},
.module = THIS_MODULE,
.ctr = crypt_ctr,
.dtr = crypt_dtr,
......
......@@ -275,7 +275,7 @@ static void flakey_map_bio(struct dm_target *ti, struct bio *bio)
struct flakey_c *fc = ti->private;
bio->bi_bdev = fc->dev->bdev;
if (bio_sectors(bio))
if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
bio->bi_iter.bi_sector =
flakey_map_sector(ti, bio->bi_iter.bi_sector);
}
......@@ -306,6 +306,14 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
pb->bio_submitted = false;
/* Do not fail reset zone */
if (bio_op(bio) == REQ_OP_ZONE_RESET)
goto map_bio;
/* We need to remap reported zones, so remember the BIO iter */
if (bio_op(bio) == REQ_OP_ZONE_REPORT)
goto map_bio;
/* Are we alive ? */
elapsed = (jiffies - fc->start_time) / HZ;
if (elapsed % (fc->up_interval + fc->down_interval) >= fc->up_interval) {
......@@ -359,11 +367,19 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
}
static int flakey_end_io(struct dm_target *ti, struct bio *bio,
blk_status_t *error)
blk_status_t *error)
{
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct per_bio_data));
if (bio_op(bio) == REQ_OP_ZONE_RESET)
return DM_ENDIO_DONE;
if (bio_op(bio) == REQ_OP_ZONE_REPORT) {
dm_remap_zone_report(ti, bio, fc->start);
return DM_ENDIO_DONE;
}
if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == READ) &&
all_corrupt_bio_flags_match(bio, fc)) {
......@@ -446,7 +462,8 @@ static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_
static struct target_type flakey_target = {
.name = "flakey",
.version = {1, 4, 0},
.version = {1, 5, 0},
.features = DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr = flakey_ctr,
.dtr = flakey_dtr,
......
......@@ -23,6 +23,14 @@
#define DM_MSG_PREFIX "ioctl"
#define DM_DRIVER_EMAIL "dm-devel@redhat.com"
struct dm_file {
/*
* poll will wait until the global event number is greater than
* this value.
*/
volatile unsigned global_event_nr;
};
/*-----------------------------------------------------------------
* The ioctl interface needs to be able to look up devices by
* name or uuid.
......@@ -456,9 +464,9 @@ void dm_deferred_remove(void)
* All the ioctl commands get dispatched to functions with this
* prototype.
*/
typedef int (*ioctl_fn)(struct dm_ioctl *param, size_t param_size);
typedef int (*ioctl_fn)(struct file *filp, struct dm_ioctl *param, size_t param_size);
static int remove_all(struct dm_ioctl *param, size_t param_size)
static int remove_all(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
dm_hash_remove_all(true, !!(param->flags & DM_DEFERRED_REMOVE), false);
param->data_size = 0;
......@@ -491,13 +499,14 @@ static void *get_result_buffer(struct dm_ioctl *param, size_t param_size,
return ((void *) param) + param->data_start;
}
static int list_devices(struct dm_ioctl *param, size_t param_size)
static int list_devices(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
unsigned int i;
struct hash_cell *hc;
size_t len, needed = 0;
struct gendisk *disk;
struct dm_name_list *nl, *old_nl = NULL;
uint32_t *event_nr;
down_write(&_hash_lock);
......@@ -510,6 +519,7 @@ static int list_devices(struct dm_ioctl *param, size_t param_size)
needed += sizeof(struct dm_name_list);
needed += strlen(hc->name) + 1;
needed += ALIGN_MASK;
needed += (sizeof(uint32_t) + ALIGN_MASK) & ~ALIGN_MASK;
}
}
......@@ -539,7 +549,9 @@ static int list_devices(struct dm_ioctl *param, size_t param_size)
strcpy(nl->name, hc->name);
old_nl = nl;
nl = align_ptr(((void *) ++nl) + strlen(hc->name) + 1);
event_nr = align_ptr(((void *) (nl + 1)) + strlen(hc->name) + 1);
*event_nr = dm_get_event_nr(hc->md);
nl = align_ptr(event_nr + 1);
}
}
......@@ -582,7 +594,7 @@ static void list_version_get_info(struct target_type *tt, void *param)
info->vers = align_ptr(((void *) ++info->vers) + strlen(tt->name) + 1);
}
static int list_versions(struct dm_ioctl *param, size_t param_size)
static int list_versions(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
size_t len, needed = 0;
struct dm_target_versions *vers;
......@@ -724,7 +736,7 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
}
}
static int dev_create(struct dm_ioctl *param, size_t param_size)
static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r, m = DM_ANY_MINOR;
struct mapped_device *md;
......@@ -816,7 +828,7 @@ static struct mapped_device *find_device(struct dm_ioctl *param)
return md;
}
static int dev_remove(struct dm_ioctl *param, size_t param_size)
static int dev_remove(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
struct mapped_device *md;
......@@ -881,7 +893,7 @@ static int invalid_str(char *str, void *end)
return -EINVAL;
}
static int dev_rename(struct dm_ioctl *param, size_t param_size)
static int dev_rename(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r;
char *new_data = (char *) param + param->data_start;
......@@ -911,7 +923,7 @@ static int dev_rename(struct dm_ioctl *param, size_t param_size)
return 0;
}
static int dev_set_geometry(struct dm_ioctl *param, size_t param_size)
static int dev_set_geometry(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r = -EINVAL, x;
struct mapped_device *md;
......@@ -1060,7 +1072,7 @@ static int do_resume(struct dm_ioctl *param)
* Set or unset the suspension state of a device.
* If the device already is in the requested state we just return its status.
*/
static int dev_suspend(struct dm_ioctl *param, size_t param_size)
static int dev_suspend(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
if (param->flags & DM_SUSPEND_FLAG)
return do_suspend(param);
......@@ -1072,7 +1084,7 @@ static int dev_suspend(struct dm_ioctl *param, size_t param_size)
* Copies device info back to user space, used by
* the create and info ioctls.
*/
static int dev_status(struct dm_ioctl *param, size_t param_size)
static int dev_status(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct mapped_device *md;
......@@ -1163,7 +1175,7 @@ static void retrieve_status(struct dm_table *table,
/*
* Wait for a device to report an event
*/
static int dev_wait(struct dm_ioctl *param, size_t param_size)
static int dev_wait(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r = 0;
struct mapped_device *md;
......@@ -1200,6 +1212,19 @@ static int dev_wait(struct dm_ioctl *param, size_t param_size)
return r;
}
/*
* Remember the global event number and make it possible to poll
* for further events.
*/
static int dev_arm_poll(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct dm_file *priv = filp->private_data;
priv->global_event_nr = atomic_read(&dm_global_event_nr);
return 0;
}
static inline fmode_t get_mode(struct dm_ioctl *param)
{
fmode_t mode = FMODE_READ | FMODE_WRITE;
......@@ -1269,7 +1294,7 @@ static bool is_valid_type(enum dm_queue_mode cur, enum dm_queue_mode new)
return false;
}
static int table_load(struct dm_ioctl *param, size_t param_size)
static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r;
struct hash_cell *hc;
......@@ -1356,7 +1381,7 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
return r;
}
static int table_clear(struct dm_ioctl *param, size_t param_size)
static int table_clear(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
struct mapped_device *md;
......@@ -1430,7 +1455,7 @@ static void retrieve_deps(struct dm_table *table,
param->data_size = param->data_start + needed;
}
static int table_deps(struct dm_ioctl *param, size_t param_size)
static int table_deps(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct mapped_device *md;
struct dm_table *table;
......@@ -1456,7 +1481,7 @@ static int table_deps(struct dm_ioctl *param, size_t param_size)
* Return the status of a device as a text string for each
* target.
*/
static int table_status(struct dm_ioctl *param, size_t param_size)
static int table_status(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
struct mapped_device *md;
struct dm_table *table;
......@@ -1511,7 +1536,7 @@ static int message_for_md(struct mapped_device *md, unsigned argc, char **argv,
/*
* Pass a message to the target that's at the supplied device offset.
*/
static int target_message(struct dm_ioctl *param, size_t param_size)
static int target_message(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
int r, argc;
char **argv;
......@@ -1628,7 +1653,8 @@ static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags)
{DM_LIST_VERSIONS_CMD, 0, list_versions},
{DM_TARGET_MSG_CMD, 0, target_message},
{DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry}
{DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry},
{DM_DEV_ARM_POLL, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll},
};
if (unlikely(cmd >= ARRAY_SIZE(_ioctls)))
......@@ -1783,7 +1809,7 @@ static int validate_params(uint cmd, struct dm_ioctl *param)
return 0;
}
static int ctl_ioctl(uint command, struct dm_ioctl __user *user)
static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *user)
{
int r = 0;
int ioctl_flags;
......@@ -1837,7 +1863,7 @@ static int ctl_ioctl(uint command, struct dm_ioctl __user *user)
goto out;
param->data_size = offsetof(struct dm_ioctl, data);
r = fn(param, input_param_size);
r = fn(file, param, input_param_size);
if (unlikely(param->flags & DM_BUFFER_FULL_FLAG) &&
unlikely(ioctl_flags & IOCTL_FLAGS_NO_PARAMS))
......@@ -1856,7 +1882,7 @@ static int ctl_ioctl(uint command, struct dm_ioctl __user *user)
static long dm_ctl_ioctl(struct file *file, uint command, ulong u)
{
return (long)ctl_ioctl(command, (struct dm_ioctl __user *)u);
return (long)ctl_ioctl(file, command, (struct dm_ioctl __user *)u);
}
#ifdef CONFIG_COMPAT
......@@ -1868,8 +1894,47 @@ static long dm_compat_ctl_ioctl(struct file *file, uint command, ulong u)
#define dm_compat_ctl_ioctl NULL
#endif
static int dm_open(struct inode *inode, struct file *filp)
{
int r;
struct dm_file *priv;
r = nonseekable_open(inode, filp);
if (unlikely(r))
return r;
priv = filp->private_data = kmalloc(sizeof(struct dm_file), GFP_KERNEL);
if (!priv)
return -ENOMEM;
priv->global_event_nr = atomic_read(&dm_global_event_nr);
return 0;
}
static int dm_release(struct inode *inode, struct file *filp)
{
kfree(filp->private_data);
return 0;
}
static unsigned dm_poll(struct file *filp, poll_table *wait)
{
struct dm_file *priv = filp->private_data;
unsigned mask = 0;
poll_wait(filp, &dm_global_eventq, wait);
if ((int)(atomic_read(&dm_global_event_nr) - priv->global_event_nr) > 0)
mask |= POLLIN;
return mask;
}
static const struct file_operations _ctl_fops = {
.open = nonseekable_open,
.open = dm_open,
.release = dm_release,
.poll = dm_poll,
.unlocked_ioctl = dm_ctl_ioctl,
.compat_ioctl = dm_compat_ctl_ioctl,
.owner = THIS_MODULE,
......
......@@ -356,6 +356,7 @@ struct kcopyd_job {
struct mutex lock;
atomic_t sub_jobs;
sector_t progress;
sector_t write_offset;
struct kcopyd_job *master_job;
};
......@@ -386,6 +387,31 @@ void dm_kcopyd_exit(void)
* Functions to push and pop a job onto the head of a given job
* list.
*/
static struct kcopyd_job *pop_io_job(struct list_head *jobs,
struct dm_kcopyd_client *kc)
{
struct kcopyd_job *job;
/*
* For I/O jobs, pop any read, any write without sequential write
* constraint and sequential writes that are at the right position.
*/
list_for_each_entry(job, jobs, list) {
if (job->rw == READ || !test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) {
list_del(&job->list);
return job;
}
if (job->write_offset == job->master_job->write_offset) {
job->master_job->write_offset += job->source.count;
list_del(&job->list);
return job;
}
}
return NULL;
}
static struct kcopyd_job *pop(struct list_head *jobs,
struct dm_kcopyd_client *kc)
{
......@@ -395,8 +421,12 @@ static struct kcopyd_job *pop(struct list_head *jobs,
spin_lock_irqsave(&kc->job_lock, flags);
if (!list_empty(jobs)) {
job = list_entry(jobs->next, struct kcopyd_job, list);
list_del(&job->list);
if (jobs == &kc->io_jobs)
job = pop_io_job(jobs, kc);
else {
job = list_entry(jobs->next, struct kcopyd_job, list);
list_del(&job->list);
}
}
spin_unlock_irqrestore(&kc->job_lock, flags);
......@@ -506,6 +536,14 @@ static int run_io_job(struct kcopyd_job *job)
.client = job->kc->io_client,
};
/*
* If we need to write sequentially and some reads or writes failed,
* no point in continuing.
*/
if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) &&
job->master_job->write_err)
return -EIO;
io_job_start(job->kc->throttle);
if (job->rw == READ)
......@@ -655,6 +693,7 @@ static void segment_complete(int read_err, unsigned long write_err,
int i;
*sub_job = *job;
sub_job->write_offset = progress;
sub_job->source.sector += progress;
sub_job->source.count = count;
......@@ -723,6 +762,27 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
job->num_dests = num_dests;
memcpy(&job->dests, dests, sizeof(*dests) * num_dests);
/*
* If one of the destination is a host-managed zoned block device,
* we need to write sequentially. If one of the destination is a
* host-aware device, then leave it to the caller to choose what to do.
*/
if (!test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) {
for (i = 0; i < job->num_dests; i++) {
if (bdev_zoned_model(dests[i].bdev) == BLK_ZONED_HM) {
set_bit(DM_KCOPYD_WRITE_SEQ, &job->flags);
break;
}
}
}
/*
* If we need to write sequentially, errors cannot be ignored.
*/
if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) &&
test_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags))
clear_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags);
if (from) {
job->source = *from;
job->pages = NULL;
......@@ -746,6 +806,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
job->fn = fn;
job->context = context;
job->master_job = job;
job->write_offset = 0;
if (job->source.count <= SUB_JOB_SIZE)
dispatch_job(job);
......
......@@ -89,7 +89,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio *bio)
struct linear_c *lc = ti->private;
bio->bi_bdev = lc->dev->bdev;
if (bio_sectors(bio))
if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
bio->bi_iter.bi_sector =
linear_map_sector(ti, bio->bi_iter.bi_sector);
}
......@@ -101,6 +101,17 @@ static int linear_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_REMAPPED;
}
static int linear_end_io(struct dm_target *ti, struct bio *bio,
blk_status_t *error)
{
struct linear_c *lc = ti->private;
if (!*error && bio_op(bio) == REQ_OP_ZONE_REPORT)
dm_remap_zone_report(ti, bio, lc->start);
return DM_ENDIO_DONE;
}
static void linear_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
......@@ -161,12 +172,13 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
static struct target_type linear_target = {
.name = "linear",
.version = {1, 3, 0},
.features = DM_TARGET_PASSES_INTEGRITY,
.version = {1, 4, 0},
.features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr = linear_ctr,
.dtr = linear_dtr,
.map = linear_map,
.end_io = linear_end_io,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
......
......@@ -1571,7 +1571,7 @@ static sector_t __rdev_sectors(struct raid_set *rs)
return rdev->sectors;
}
BUG(); /* Constructor ensures we got some. */
return 0;
}
/* Calculate the sectors per device and per array used for @rs */
......@@ -2941,7 +2941,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
bool resize;
struct raid_type *rt;
unsigned int num_raid_params, num_raid_devs;
sector_t calculated_dev_sectors;
sector_t calculated_dev_sectors, rdev_sectors;
struct raid_set *rs = NULL;
const char *arg;
struct rs_layout rs_layout;
......@@ -3017,7 +3017,14 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
if (r)
goto bad;
resize = calculated_dev_sectors != __rdev_sectors(rs);
rdev_sectors = __rdev_sectors(rs);
if (!rdev_sectors) {
ti->error = "Invalid rdev size";
r = -EINVAL;
goto bad;
}
resize = calculated_dev_sectors != rdev_sectors;
INIT_WORK(&rs->md.event_work, do_table_event);
ti->private = rs;
......
......@@ -319,6 +319,39 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
return 1;
}
/*
* If the target is mapped to zoned block device(s), check
* that the zones are not partially mapped.
*/
if (bdev_zoned_model(bdev) != BLK_ZONED_NONE) {
unsigned int zone_sectors = bdev_zone_sectors(bdev);
if (start & (zone_sectors - 1)) {
DMWARN("%s: start=%llu not aligned to h/w zone size %u of %s",
dm_device_name(ti->table->md),
(unsigned long long)start,
zone_sectors, bdevname(bdev, b));
return 1;
}
/*
* Note: The last zone of a zoned block device may be smaller
* than other zones. So for a target mapping the end of a
* zoned block device with such a zone, len would not be zone
* aligned. We do not allow such last smaller zone to be part
* of the mapping here to ensure that mappings with multiple
* devices do not end up with a smaller zone in the middle of
* the sector range.
*/
if (len & (zone_sectors - 1)) {
DMWARN("%s: len=%llu not aligned to h/w zone size %u of %s",
dm_device_name(ti->table->md),
(unsigned long long)len,
zone_sectors, bdevname(bdev, b));
return 1;
}
}
if (logical_block_size_sectors <= 1)
return 0;
......@@ -456,6 +489,8 @@ static int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
q->limits.alignment_offset,
(unsigned long long) start << SECTOR_SHIFT);
limits->zoned = blk_queue_zoned_model(q);
return 0;
}
......@@ -1346,6 +1381,88 @@ bool dm_table_has_no_data_devices(struct dm_table *table)
return true;
}
static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
enum blk_zoned_model *zoned_model = data;
return q && blk_queue_zoned_model(q) == *zoned_model;
}
static bool dm_table_supports_zoned_model(struct dm_table *t,
enum blk_zoned_model zoned_model)
{
struct dm_target *ti;
unsigned i;
for (i = 0; i < dm_table_get_num_targets(t); i++) {
ti = dm_table_get_target(t, i);
if (zoned_model == BLK_ZONED_HM &&
!dm_target_supports_zoned_hm(ti->type))
return false;
if (!ti->type->iterate_devices ||
!ti->type->iterate_devices(ti, device_is_zoned_model, &zoned_model))
return false;
}
return true;
}
static int device_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
unsigned int *zone_sectors = data;
return q && blk_queue_zone_sectors(q) == *zone_sectors;
}
static bool dm_table_matches_zone_sectors(struct dm_table *t,
unsigned int zone_sectors)
{
struct dm_target *ti;
unsigned i;
for (i = 0; i < dm_table_get_num_targets(t); i++) {
ti = dm_table_get_target(t, i);
if (!ti->type->iterate_devices ||
!ti->type->iterate_devices(ti, device_matches_zone_sectors, &zone_sectors))
return false;
}
return true;
}
static int validate_hardware_zoned_model(struct dm_table *table,
enum blk_zoned_model zoned_model,
unsigned int zone_sectors)
{
if (zoned_model == BLK_ZONED_NONE)
return 0;
if (!dm_table_supports_zoned_model(table, zoned_model)) {
DMERR("%s: zoned model is not consistent across all devices",
dm_device_name(table->md));
return -EINVAL;
}
/* Check zone size validity and compatibility */
if (!zone_sectors || !is_power_of_2(zone_sectors))
return -EINVAL;
if (!dm_table_matches_zone_sectors(table, zone_sectors)) {
DMERR("%s: zone sectors is not consistent across all devices",
dm_device_name(table->md));
return -EINVAL;
}
return 0;
}
/*
* Establish the new table's queue_limits and validate them.
*/
......@@ -1355,6 +1472,8 @@ int dm_calculate_queue_limits(struct dm_table *table,
struct dm_target *ti;
struct queue_limits ti_limits;
unsigned i;
enum blk_zoned_model zoned_model = BLK_ZONED_NONE;
unsigned int zone_sectors = 0;
blk_set_stacking_limits(limits);
......@@ -1372,6 +1491,15 @@ int dm_calculate_queue_limits(struct dm_table *table,
ti->type->iterate_devices(ti, dm_set_device_limits,
&ti_limits);
if (zoned_model == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
/*
* After stacking all limits, validate all devices
* in table support this zoned model and zone sectors.
*/
zoned_model = ti_limits.zoned;
zone_sectors = ti_limits.chunk_sectors;
}
/* Set I/O hints portion of queue limits */
if (ti->type->io_hints)
ti->type->io_hints(ti, &ti_limits);
......@@ -1396,8 +1524,42 @@ int dm_calculate_queue_limits(struct dm_table *table,
dm_device_name(table->md),
(unsigned long long) ti->begin,
(unsigned long long) ti->len);
/*
* FIXME: this should likely be moved to blk_stack_limits(), would
* also eliminate limits->zoned stacking hack in dm_set_device_limits()
*/
if (limits->zoned == BLK_ZONED_NONE && ti_limits.zoned != BLK_ZONED_NONE) {
/*
* By default, the stacked limits zoned model is set to
* BLK_ZONED_NONE in blk_set_stacking_limits(). Update
* this model using the first target model reported
* that is not BLK_ZONED_NONE. This will be either the
* first target device zoned model or the model reported
* by the target .io_hints.
*/
limits->zoned = ti_limits.zoned;
}
}
/*
* Verify that the zoned model and zone sectors, as determined before
* any .io_hints override, are the same across all devices in the table.
* - this is especially relevant if .io_hints is emulating a disk-managed
* zoned model (aka BLK_ZONED_NONE) on host-managed zoned block devices.
* BUT...
*/
if (limits->zoned != BLK_ZONED_NONE) {
/*
* ...IF the above limits stacking determined a zoned model
* validate that all of the table's devices conform to it.
*/
zoned_model = limits->zoned;
zone_sectors = limits->chunk_sectors;
}
if (validate_hardware_zoned_model(table, zoned_model, zone_sectors))
return -EINVAL;
return validate_hardware_logical_block_alignment(table, limits);
}
......
此差异已折叠。
/*
* Copyright (C) 2017 Western Digital Corporation or its affiliates.
*
* This file is released under the GPL.
*/
#include "dm-zoned.h"
#include <linux/module.h>
#define DM_MSG_PREFIX "zoned reclaim"
struct dmz_reclaim {
struct dmz_metadata *metadata;
struct dmz_dev *dev;
struct delayed_work work;
struct workqueue_struct *wq;
struct dm_kcopyd_client *kc;
struct dm_kcopyd_throttle kc_throttle;
int kc_err;
unsigned long flags;
/* Last target access time */
unsigned long atime;
};
/*
* Reclaim state flags.
*/
enum {
DMZ_RECLAIM_KCOPY,
};
/*
* Number of seconds of target BIO inactivity to consider the target idle.
*/
#define DMZ_IDLE_PERIOD (10UL * HZ)
/*
* Percentage of unmapped (free) random zones below which reclaim starts
* even if the target is busy.
*/
#define DMZ_RECLAIM_LOW_UNMAP_RND 30
/*
* Percentage of unmapped (free) random zones above which reclaim will
* stop if the target is busy.
*/
#define DMZ_RECLAIM_HIGH_UNMAP_RND 50
/*
* Align a sequential zone write pointer to chunk_block.
*/
static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone,
sector_t block)
{
struct dmz_metadata *zmd = zrc->metadata;
sector_t wp_block = zone->wp_block;
unsigned int nr_blocks;
int ret;
if (wp_block == block)
return 0;
if (wp_block > block)
return -EIO;
/*
* Zeroout the space between the write
* pointer and the requested position.
*/
nr_blocks = block - wp_block;
ret = blkdev_issue_zeroout(zrc->dev->bdev,
dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block),
dmz_blk2sect(nr_blocks), GFP_NOFS, false);
if (ret) {
dmz_dev_err(zrc->dev,
"Align zone %u wp %llu to %llu (wp+%u) blocks failed %d",
dmz_id(zmd, zone), (unsigned long long)wp_block,
(unsigned long long)block, nr_blocks, ret);
return ret;
}
zone->wp_block = block;
return 0;
}
/*
* dm_kcopyd_copy end notification.
*/
static void dmz_reclaim_kcopy_end(int read_err, unsigned long write_err,
void *context)
{
struct dmz_reclaim *zrc = context;
if (read_err || write_err)
zrc->kc_err = -EIO;
else
zrc->kc_err = 0;
clear_bit_unlock(DMZ_RECLAIM_KCOPY, &zrc->flags);
smp_mb__after_atomic();
wake_up_bit(&zrc->flags, DMZ_RECLAIM_KCOPY);
}
/*
* Copy valid blocks of src_zone into dst_zone.
*/
static int dmz_reclaim_copy(struct dmz_reclaim *zrc,
struct dm_zone *src_zone, struct dm_zone *dst_zone)
{
struct dmz_metadata *zmd = zrc->metadata;
struct dmz_dev *dev = zrc->dev;
struct dm_io_region src, dst;
sector_t block = 0, end_block;
sector_t nr_blocks;
sector_t src_zone_block;
sector_t dst_zone_block;
unsigned long flags = 0;
int ret;
if (dmz_is_seq(src_zone))
end_block = src_zone->wp_block;
else
end_block = dev->zone_nr_blocks;
src_zone_block = dmz_start_block(zmd, src_zone);
dst_zone_block = dmz_start_block(zmd, dst_zone);
if (dmz_is_seq(dst_zone))
set_bit(DM_KCOPYD_WRITE_SEQ, &flags);
while (block < end_block) {
/* Get a valid region from the source zone */
ret = dmz_first_valid_block(zmd, src_zone, &block);
if (ret <= 0)
return ret;
nr_blocks = ret;
/*
* If we are writing in a sequential zone, we must make sure
* that writes are sequential. So Zeroout any eventual hole
* between writes.
*/
if (dmz_is_seq(dst_zone)) {
ret = dmz_reclaim_align_wp(zrc, dst_zone, block);
if (ret)
return ret;
}
src.bdev = dev->bdev;
src.sector = dmz_blk2sect(src_zone_block + block);
src.count = dmz_blk2sect(nr_blocks);
dst.bdev = dev->bdev;
dst.sector = dmz_blk2sect(dst_zone_block + block);
dst.count = src.count;
/* Copy the valid region */
set_bit(DMZ_RECLAIM_KCOPY, &zrc->flags);
ret = dm_kcopyd_copy(zrc->kc, &src, 1, &dst, flags,
dmz_reclaim_kcopy_end, zrc);
if (ret)
return ret;
/* Wait for copy to complete */
wait_on_bit_io(&zrc->flags, DMZ_RECLAIM_KCOPY,
TASK_UNINTERRUPTIBLE);
if (zrc->kc_err)
return zrc->kc_err;
block += nr_blocks;
if (dmz_is_seq(dst_zone))
dst_zone->wp_block = block;
}
return 0;
}
/*
* Move valid blocks of dzone buffer zone into dzone (after its write pointer)
* and free the buffer zone.
*/
static int dmz_reclaim_buf(struct dmz_reclaim *zrc, struct dm_zone *dzone)
{
struct dm_zone *bzone = dzone->bzone;
sector_t chunk_block = dzone->wp_block;
struct dmz_metadata *zmd = zrc->metadata;
int ret;
dmz_dev_debug(zrc->dev,
"Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)",
dzone->chunk, dmz_id(zmd, bzone), dmz_weight(bzone),
dmz_id(zmd, dzone), dmz_weight(dzone));
/* Flush data zone into the buffer zone */
ret = dmz_reclaim_copy(zrc, bzone, dzone);
if (ret < 0)
return ret;
dmz_lock_flush(zmd);
/* Validate copied blocks */
ret = dmz_merge_valid_blocks(zmd, bzone, dzone, chunk_block);
if (ret == 0) {
/* Free the buffer zone */
dmz_invalidate_blocks(zmd, bzone, 0, zrc->dev->zone_nr_blocks);
dmz_lock_map(zmd);
dmz_unmap_zone(zmd, bzone);
dmz_unlock_zone_reclaim(dzone);
dmz_free_zone(zmd, bzone);
dmz_unlock_map(zmd);
}
dmz_unlock_flush(zmd);
return 0;
}
/*
* Merge valid blocks of dzone into its buffer zone and free dzone.
*/
static int dmz_reclaim_seq_data(struct dmz_reclaim *zrc, struct dm_zone *dzone)
{
unsigned int chunk = dzone->chunk;
struct dm_zone *bzone = dzone->bzone;
struct dmz_metadata *zmd = zrc->metadata;
int ret = 0;
dmz_dev_debug(zrc->dev,
"Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)",
chunk, dmz_id(zmd, dzone), dmz_weight(dzone),
dmz_id(zmd, bzone), dmz_weight(bzone));
/* Flush data zone into the buffer zone */
ret = dmz_reclaim_copy(zrc, dzone, bzone);
if (ret < 0)
return ret;
dmz_lock_flush(zmd);
/* Validate copied blocks */
ret = dmz_merge_valid_blocks(zmd, dzone, bzone, 0);
if (ret == 0) {
/*
* Free the data zone and remap the chunk to
* the buffer zone.
*/
dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks);
dmz_lock_map(zmd);
dmz_unmap_zone(zmd, bzone);
dmz_unmap_zone(zmd, dzone);
dmz_unlock_zone_reclaim(dzone);
dmz_free_zone(zmd, dzone);
dmz_map_zone(zmd, bzone, chunk);
dmz_unlock_map(zmd);
}
dmz_unlock_flush(zmd);
return 0;
}
/*
* Move valid blocks of the random data zone dzone into a free sequential zone.
* Once blocks are moved, remap the zone chunk to the sequential zone.
*/
static int dmz_reclaim_rnd_data(struct dmz_reclaim *zrc, struct dm_zone *dzone)
{
unsigned int chunk = dzone->chunk;
struct dm_zone *szone = NULL;
struct dmz_metadata *zmd = zrc->metadata;
int ret;
/* Get a free sequential zone */
dmz_lock_map(zmd);
szone = dmz_alloc_zone(zmd, DMZ_ALLOC_RECLAIM);
dmz_unlock_map(zmd);
if (!szone)
return -ENOSPC;
dmz_dev_debug(zrc->dev,
"Chunk %u, move rnd zone %u (weight %u) to seq zone %u",
chunk, dmz_id(zmd, dzone), dmz_weight(dzone),
dmz_id(zmd, szone));
/* Flush the random data zone into the sequential zone */
ret = dmz_reclaim_copy(zrc, dzone, szone);
dmz_lock_flush(zmd);
if (ret == 0) {
/* Validate copied blocks */
ret = dmz_copy_valid_blocks(zmd, dzone, szone);
}
if (ret) {
/* Free the sequential zone */
dmz_lock_map(zmd);
dmz_free_zone(zmd, szone);
dmz_unlock_map(zmd);
} else {
/* Free the data zone and remap the chunk */
dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks);
dmz_lock_map(zmd);
dmz_unmap_zone(zmd, dzone);
dmz_unlock_zone_reclaim(dzone);
dmz_free_zone(zmd, dzone);
dmz_map_zone(zmd, szone, chunk);
dmz_unlock_map(zmd);
}
dmz_unlock_flush(zmd);
return 0;
}
/*
* Reclaim an empty zone.
*/
static void dmz_reclaim_empty(struct dmz_reclaim *zrc, struct dm_zone *dzone)
{
struct dmz_metadata *zmd = zrc->metadata;
dmz_lock_flush(zmd);
dmz_lock_map(zmd);
dmz_unmap_zone(zmd, dzone);
dmz_unlock_zone_reclaim(dzone);
dmz_free_zone(zmd, dzone);
dmz_unlock_map(zmd);
dmz_unlock_flush(zmd);
}
/*
* Find a candidate zone for reclaim and process it.
*/
static void dmz_reclaim(struct dmz_reclaim *zrc)
{
struct dmz_metadata *zmd = zrc->metadata;
struct dm_zone *dzone;
struct dm_zone *rzone;
unsigned long start;
int ret;
/* Get a data zone */
dzone = dmz_get_zone_for_reclaim(zmd);
if (!dzone)
return;
start = jiffies;
if (dmz_is_rnd(dzone)) {
if (!dmz_weight(dzone)) {
/* Empty zone */
dmz_reclaim_empty(zrc, dzone);
ret = 0;
} else {
/*
* Reclaim the random data zone by moving its
* valid data blocks to a free sequential zone.
*/
ret = dmz_reclaim_rnd_data(zrc, dzone);
}
rzone = dzone;
} else {
struct dm_zone *bzone = dzone->bzone;
sector_t chunk_block = 0;
ret = dmz_first_valid_block(zmd, bzone, &chunk_block);
if (ret < 0)
goto out;
if (ret == 0 || chunk_block >= dzone->wp_block) {
/*
* The buffer zone is empty or its valid blocks are
* after the data zone write pointer.
*/
ret = dmz_reclaim_buf(zrc, dzone);
rzone = bzone;
} else {
/*
* Reclaim the data zone by merging it into the
* buffer zone so that the buffer zone itself can
* be later reclaimed.
*/
ret = dmz_reclaim_seq_data(zrc, dzone);
rzone = dzone;
}
}
out:
if (ret) {
dmz_unlock_zone_reclaim(dzone);
return;
}
(void) dmz_flush_metadata(zrc->metadata);
dmz_dev_debug(zrc->dev, "Reclaimed zone %u in %u ms",
dmz_id(zmd, rzone), jiffies_to_msecs(jiffies - start));
}
/*
* Test if the target device is idle.
*/
static inline int dmz_target_idle(struct dmz_reclaim *zrc)
{
return time_is_before_jiffies(zrc->atime + DMZ_IDLE_PERIOD);
}
/*
* Test if reclaim is necessary.
*/
static bool dmz_should_reclaim(struct dmz_reclaim *zrc)
{
struct dmz_metadata *zmd = zrc->metadata;
unsigned int nr_rnd = dmz_nr_rnd_zones(zmd);
unsigned int nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd);
unsigned int p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd;
/* Reclaim when idle */
if (dmz_target_idle(zrc) && nr_unmap_rnd < nr_rnd)
return true;
/* If there are still plenty of random zones, do not reclaim */
if (p_unmap_rnd >= DMZ_RECLAIM_HIGH_UNMAP_RND)
return false;
/*
* If the percentage of unmappped random zones is low,
* reclaim even if the target is busy.
*/
return p_unmap_rnd <= DMZ_RECLAIM_LOW_UNMAP_RND;
}
/*
* Reclaim work function.
*/
static void dmz_reclaim_work(struct work_struct *work)
{
struct dmz_reclaim *zrc = container_of(work, struct dmz_reclaim, work.work);
struct dmz_metadata *zmd = zrc->metadata;
unsigned int nr_rnd, nr_unmap_rnd;
unsigned int p_unmap_rnd;
if (!dmz_should_reclaim(zrc)) {
mod_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD);
return;
}
/*
* We need to start reclaiming random zones: set up zone copy
* throttling to either go fast if we are very low on random zones
* and slower if there are still some free random zones to avoid
* as much as possible to negatively impact the user workload.
*/
nr_rnd = dmz_nr_rnd_zones(zmd);
nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd);
p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd;
if (dmz_target_idle(zrc) || p_unmap_rnd < DMZ_RECLAIM_LOW_UNMAP_RND / 2) {
/* Idle or very low percentage: go fast */
zrc->kc_throttle.throttle = 100;
} else {
/* Busy but we still have some random zone: throttle */
zrc->kc_throttle.throttle = min(75U, 100U - p_unmap_rnd / 2);
}
dmz_dev_debug(zrc->dev,
"Reclaim (%u): %s, %u%% free rnd zones (%u/%u)",
zrc->kc_throttle.throttle,
(dmz_target_idle(zrc) ? "Idle" : "Busy"),
p_unmap_rnd, nr_unmap_rnd, nr_rnd);
dmz_reclaim(zrc);
dmz_schedule_reclaim(zrc);
}
/*
* Initialize reclaim.
*/
int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd,
struct dmz_reclaim **reclaim)
{
struct dmz_reclaim *zrc;
int ret;
zrc = kzalloc(sizeof(struct dmz_reclaim), GFP_KERNEL);
if (!zrc)
return -ENOMEM;
zrc->dev = dev;
zrc->metadata = zmd;
zrc->atime = jiffies;
/* Reclaim kcopyd client */
zrc->kc = dm_kcopyd_client_create(&zrc->kc_throttle);
if (IS_ERR(zrc->kc)) {
ret = PTR_ERR(zrc->kc);
zrc->kc = NULL;
goto err;
}
/* Reclaim work */
INIT_DELAYED_WORK(&zrc->work, dmz_reclaim_work);
zrc->wq = alloc_ordered_workqueue("dmz_rwq_%s", WQ_MEM_RECLAIM,
dev->name);
if (!zrc->wq) {
ret = -ENOMEM;
goto err;
}
*reclaim = zrc;
queue_delayed_work(zrc->wq, &zrc->work, 0);
return 0;
err:
if (zrc->kc)
dm_kcopyd_client_destroy(zrc->kc);
kfree(zrc);
return ret;
}
/*
* Terminate reclaim.
*/
void dmz_dtr_reclaim(struct dmz_reclaim *zrc)
{
cancel_delayed_work_sync(&zrc->work);
destroy_workqueue(zrc->wq);
dm_kcopyd_client_destroy(zrc->kc);
kfree(zrc);
}
/*
* Suspend reclaim.
*/
void dmz_suspend_reclaim(struct dmz_reclaim *zrc)
{
cancel_delayed_work_sync(&zrc->work);
}
/*
* Resume reclaim.
*/
void dmz_resume_reclaim(struct dmz_reclaim *zrc)
{
queue_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD);
}
/*
* BIO accounting.
*/
void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc)
{
zrc->atime = jiffies;
}
/*
* Start reclaim if necessary.
*/
void dmz_schedule_reclaim(struct dmz_reclaim *zrc)
{
if (dmz_should_reclaim(zrc))
mod_delayed_work(zrc->wq, &zrc->work, 0);
}
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -20,6 +20,7 @@
#define DM_KCOPYD_MAX_REGIONS 8
#define DM_KCOPYD_IGNORE_ERROR 1
#define DM_KCOPYD_WRITE_SEQ 2
struct dm_kcopyd_throttle {
unsigned throttle;
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册