提交 d4c90b1b 编写于 作者: L Linus Torvalds

Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block

Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
What: /sys/module/xen_blkback/parameters/max_buffer_pages
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of free pages to keep in each block
backend buffer.
What: /sys/module/xen_blkback/parameters/max_persistent_grants
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of grants to map persistently in
blkback. If the frontend tries to use more than
max_persistent_grants, the LRU kicks in and starts
removing 5% of max_persistent_grants every 100ms.
What: /sys/module/xen_blkfront/parameters/max
Date: June 2013
KernelVersion: 3.11
Contact: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Description:
Maximum number of segments that the frontend will negotiate
with the backend for indirect descriptors. The default value
is 32 - higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend
and its default backend value.
......@@ -46,29 +46,33 @@ you format your backing devices and cache device at the same time, you won't
have to manually attach:
make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
To make bcache devices known to the kernel, echo them to /sys/fs/bcache/register:
bcache-tools now ships udev rules, and bcache devices are known to the kernel
immediately. Without udev, you can manually register devices like this:
echo /dev/sdb > /sys/fs/bcache/register
echo /dev/sdc > /sys/fs/bcache/register
To register your bcache devices automatically, you could add something like
this to an init script:
Registering the backing device makes the bcache device show up in /dev; you can
now format it and use it as normal. But the first time using a new bcache
device, it'll be running in passthrough mode until you attach it to a cache.
See the section on attaching.
echo /dev/sd* > /sys/fs/bcache/register_quiet
The devices show up as:
It'll look for bcache superblocks and ignore everything that doesn't have one.
/dev/bcache<N>
Registering the backing device makes the bcache show up in /dev; you can now
format it and use it as normal. But the first time using a new bcache device,
it'll be running in passthrough mode until you attach it to a cache. See the
section on attaching.
As well as (with udev):
The devices show up at /dev/bcacheN, and can be controlled via sysfs from
/sys/block/bcacheN/bcache:
/dev/bcache/by-uuid/<uuid>
/dev/bcache/by-label/<label>
To get started:
mkfs.ext4 /dev/bcache0
mount /dev/bcache0 /mnt
You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache .
Cache devices are managed as sets; multiple caches per set isn't supported yet
but will allow for mirroring of metadata and dirty data in the future. Your new
cache set shows up as /sys/fs/bcache/<UUID>
......@@ -80,11 +84,11 @@ must be attached to your cache set to enable caching. Attaching a backing
device to a cache set is done thusly, with the UUID of the cache set in
/sys/fs/bcache:
echo <UUID> > /sys/block/bcache0/bcache/attach
echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
This only has to be done once. The next time you reboot, just reregister all
your bcache devices. If a backing device has data in a cache somewhere, the
/dev/bcache# device won't be created until the cache shows up - particularly
/dev/bcache<N> device won't be created until the cache shows up - particularly
important if you have writeback caching turned on.
If you're booting up and your cache device is gone and never coming back, you
......@@ -191,6 +195,9 @@ want for getting the best possible numbers when benchmarking.
SYSFS - BACKING DEVICE:
Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
(if attached) /sys/fs/bcache/<cset-uuid>/bdev*
attach
Echo the UUID of a cache set to this file to enable caching.
......@@ -300,6 +307,8 @@ cache_readaheads
SYSFS - CACHE SET:
Available at /sys/fs/bcache/<cset-uuid>
average_key_size
Average data per key in the btree.
......@@ -390,6 +399,8 @@ trigger_gc
SYSFS - CACHE DEVICE:
Available at /sys/block/<cdev>/bcache
block_size
Minimum granularity of writes - should match hardware sector size.
......
......@@ -1642,7 +1642,7 @@ S: Maintained
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Kent Overstreet <koverstreet@google.com>
M: Kent Overstreet <kmo@daterainc.com>
L: linux-bcache@vger.kernel.org
W: http://bcache.evilpiepirate.org
S: Maintained:
......@@ -3346,7 +3346,7 @@ F: Documentation/firmware_class/
F: drivers/base/firmware*.c
F: include/linux/firmware.h
FLASHSYSTEM DRIVER (IBM FlashSystem 70/80 PCI SSD Flash Card)
FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card)
M: Joshua Morris <josh.h.morris@us.ibm.com>
M: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
S: Maintained
......
......@@ -532,11 +532,11 @@ config BLK_DEV_RBD
If unsure, say N.
config BLK_DEV_RSXX
tristate "IBM FlashSystem 70/80 PCIe SSD Device Driver"
tristate "IBM Flash Adapter 900GB Full Height PCIe Device Driver"
depends on PCI
help
Device driver for IBM's high speed PCIe SSD
storage devices: FlashSystem-70 and FlashSystem-80.
storage device: Flash Adapter 900GB Full Height.
To compile this driver as a module, choose M here: the
module will be called rsxx.
......
......@@ -659,6 +659,27 @@ void drbd_al_shrink(struct drbd_conf *mdev)
wake_up(&mdev->al_wait);
}
int drbd_initialize_al(struct drbd_conf *mdev, void *buffer)
{
struct al_transaction_on_disk *al = buffer;
struct drbd_md *md = &mdev->ldev->md;
sector_t al_base = md->md_offset + md->al_offset;
int al_size_4k = md->al_stripes * md->al_stripe_size_4k;
int i;
memset(al, 0, 4096);
al->magic = cpu_to_be32(DRBD_AL_MAGIC);
al->transaction_type = cpu_to_be16(AL_TR_INITIALIZED);
al->crc32c = cpu_to_be32(crc32c(0, al, 4096));
for (i = 0; i < al_size_4k; i++) {
int err = drbd_md_sync_page_io(mdev, mdev->ldev, al_base + i * 8, WRITE);
if (err)
return err;
}
return 0;
}
static int w_update_odbm(struct drbd_work *w, int unused)
{
struct update_odbm_work *udw = container_of(w, struct update_odbm_work, w);
......
......@@ -832,6 +832,7 @@ struct drbd_tconn { /* is a resource from the config file */
unsigned susp_nod:1; /* IO suspended because no data */
unsigned susp_fen:1; /* IO suspended because fence peer handler runs */
struct mutex cstate_mutex; /* Protects graceful disconnects */
unsigned int connect_cnt; /* Inc each time a connection is established */
unsigned long flags;
struct net_conf *net_conf; /* content protected by rcu */
......@@ -1132,6 +1133,7 @@ extern void drbd_mdev_cleanup(struct drbd_conf *mdev);
void drbd_print_uuids(struct drbd_conf *mdev, const char *text);
extern void conn_md_sync(struct drbd_tconn *tconn);
extern void drbd_md_write(struct drbd_conf *mdev, void *buffer);
extern void drbd_md_sync(struct drbd_conf *mdev);
extern int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev);
extern void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);
......@@ -1466,8 +1468,16 @@ extern void drbd_suspend_io(struct drbd_conf *mdev);
extern void drbd_resume_io(struct drbd_conf *mdev);
extern char *ppsize(char *buf, unsigned long long size);
extern sector_t drbd_new_dev_size(struct drbd_conf *, struct drbd_backing_dev *, sector_t, int);
enum determine_dev_size { dev_size_error = -1, unchanged = 0, shrunk = 1, grew = 2 };
extern enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *, enum dds_flags) __must_hold(local);
enum determine_dev_size {
DS_ERROR_SHRINK = -3,
DS_ERROR_SPACE_MD = -2,
DS_ERROR = -1,
DS_UNCHANGED = 0,
DS_SHRUNK = 1,
DS_GREW = 2
};
extern enum determine_dev_size
drbd_determine_dev_size(struct drbd_conf *, enum dds_flags, struct resize_parms *) __must_hold(local);
extern void resync_after_online_grow(struct drbd_conf *);
extern void drbd_reconsider_max_bio_size(struct drbd_conf *mdev);
extern enum drbd_state_rv drbd_set_role(struct drbd_conf *mdev,
......@@ -1633,6 +1643,7 @@ extern int __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector,
#define drbd_set_out_of_sync(mdev, sector, size) \
__drbd_set_out_of_sync(mdev, sector, size, __FILE__, __LINE__)
extern void drbd_al_shrink(struct drbd_conf *mdev);
extern int drbd_initialize_al(struct drbd_conf *, void *);
/* drbd_nl.c */
/* state info broadcast */
......
......@@ -2762,8 +2762,6 @@ int __init drbd_init(void)
/*
* allocate all necessary structs
*/
err = -ENOMEM;
init_waitqueue_head(&drbd_pp_wait);
drbd_proc = NULL; /* play safe for drbd_cleanup */
......@@ -2773,6 +2771,7 @@ int __init drbd_init(void)
if (err)
goto fail;
err = -ENOMEM;
drbd_proc = proc_create_data("drbd", S_IFREG | S_IRUGO , NULL, &drbd_proc_fops, NULL);
if (!drbd_proc) {
printk(KERN_ERR "drbd: unable to register proc file\n");
......@@ -2803,7 +2802,6 @@ int __init drbd_init(void)
fail:
drbd_cleanup();
if (err == -ENOMEM)
/* currently always the case */
printk(KERN_ERR "drbd: ran out of memory\n");
else
printk(KERN_ERR "drbd: initialization failure\n");
......@@ -2881,34 +2879,14 @@ struct meta_data_on_disk {
u8 reserved_u8[4096 - (7*8 + 10*4)];
} __packed;
/**
* drbd_md_sync() - Writes the meta data super block if the MD_DIRTY flag bit is set
* @mdev: DRBD device.
*/
void drbd_md_sync(struct drbd_conf *mdev)
void drbd_md_write(struct drbd_conf *mdev, void *b)
{
struct meta_data_on_disk *buffer;
struct meta_data_on_disk *buffer = b;
sector_t sector;
int i;
/* Don't accidentally change the DRBD meta data layout. */
BUILD_BUG_ON(UI_SIZE != 4);
BUILD_BUG_ON(sizeof(struct meta_data_on_disk) != 4096);
del_timer(&mdev->md_sync_timer);
/* timer may be rearmed by drbd_md_mark_dirty() now. */
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
return;
/* We use here D_FAILED and not D_ATTACHING because we try to write
* metadata even if we detach due to a disk failure! */
if (!get_ldev_if_state(mdev, D_FAILED))
return;
buffer = drbd_md_get_buffer(mdev);
if (!buffer)
goto out;
memset(buffer, 0, sizeof(*buffer));
buffer->la_size_sect = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
......@@ -2937,6 +2915,35 @@ void drbd_md_sync(struct drbd_conf *mdev)
dev_err(DEV, "meta data update failed!\n");
drbd_chk_io_error(mdev, 1, DRBD_META_IO_ERROR);
}
}
/**
* drbd_md_sync() - Writes the meta data super block if the MD_DIRTY flag bit is set
* @mdev: DRBD device.
*/
void drbd_md_sync(struct drbd_conf *mdev)
{
struct meta_data_on_disk *buffer;
/* Don't accidentally change the DRBD meta data layout. */
BUILD_BUG_ON(UI_SIZE != 4);
BUILD_BUG_ON(sizeof(struct meta_data_on_disk) != 4096);
del_timer(&mdev->md_sync_timer);
/* timer may be rearmed by drbd_md_mark_dirty() now. */
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
return;
/* We use here D_FAILED and not D_ATTACHING because we try to write
* metadata even if we detach due to a disk failure! */
if (!get_ldev_if_state(mdev, D_FAILED))
return;
buffer = drbd_md_get_buffer(mdev);
if (!buffer)
goto out;
drbd_md_write(mdev, buffer);
/* Update mdev->ldev->md.la_size_sect,
* since we updated it on metadata. */
......
......@@ -417,6 +417,7 @@ static enum drbd_fencing_p highest_fencing_policy(struct drbd_tconn *tconn)
bool conn_try_outdate_peer(struct drbd_tconn *tconn)
{
unsigned int connect_cnt;
union drbd_state mask = { };
union drbd_state val = { };
enum drbd_fencing_p fp;
......@@ -428,6 +429,10 @@ bool conn_try_outdate_peer(struct drbd_tconn *tconn)
return false;
}
spin_lock_irq(&tconn->req_lock);
connect_cnt = tconn->connect_cnt;
spin_unlock_irq(&tconn->req_lock);
fp = highest_fencing_policy(tconn);
switch (fp) {
case FP_NOT_AVAIL:
......@@ -492,8 +497,14 @@ bool conn_try_outdate_peer(struct drbd_tconn *tconn)
here, because we might were able to re-establish the connection in the
meantime. */
spin_lock_irq(&tconn->req_lock);
if (tconn->cstate < C_WF_REPORT_PARAMS && !test_bit(STATE_SENT, &tconn->flags))
_conn_request_state(tconn, mask, val, CS_VERBOSE);
if (tconn->cstate < C_WF_REPORT_PARAMS && !test_bit(STATE_SENT, &tconn->flags)) {
if (tconn->connect_cnt != connect_cnt)
/* In case the connection was established and droped
while the fence-peer handler was running, ignore it */
conn_info(tconn, "Ignoring fence-peer exit code\n");
else
_conn_request_state(tconn, mask, val, CS_VERBOSE);
}
spin_unlock_irq(&tconn->req_lock);
return conn_highest_pdsk(tconn) <= D_OUTDATED;
......@@ -816,15 +827,20 @@ void drbd_resume_io(struct drbd_conf *mdev)
* Returns 0 on success, negative return values indicate errors.
* You should call drbd_md_sync() after calling this function.
*/
enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds_flags flags) __must_hold(local)
enum determine_dev_size
drbd_determine_dev_size(struct drbd_conf *mdev, enum dds_flags flags, struct resize_parms *rs) __must_hold(local)
{
sector_t prev_first_sect, prev_size; /* previous meta location */
sector_t la_size_sect, u_size;
struct drbd_md *md = &mdev->ldev->md;
u32 prev_al_stripe_size_4k;
u32 prev_al_stripes;
sector_t size;
char ppb[10];
void *buffer;
int md_moved, la_size_changed;
enum determine_dev_size rv = unchanged;
enum determine_dev_size rv = DS_UNCHANGED;
/* race:
* application request passes inc_ap_bio,
......@@ -836,6 +852,11 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
* still lock the act_log to not trigger ASSERTs there.
*/
drbd_suspend_io(mdev);
buffer = drbd_md_get_buffer(mdev); /* Lock meta-data IO */
if (!buffer) {
drbd_resume_io(mdev);
return DS_ERROR;
}
/* no wait necessary anymore, actually we could assert that */
wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
......@@ -844,7 +865,17 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
prev_size = mdev->ldev->md.md_size_sect;
la_size_sect = mdev->ldev->md.la_size_sect;
/* TODO: should only be some assert here, not (re)init... */
if (rs) {
/* rs is non NULL if we should change the AL layout only */
prev_al_stripes = md->al_stripes;
prev_al_stripe_size_4k = md->al_stripe_size_4k;
md->al_stripes = rs->al_stripes;
md->al_stripe_size_4k = rs->al_stripe_size / 4;
md->al_size_4k = (u64)rs->al_stripes * rs->al_stripe_size / 4;
}
drbd_md_set_sector_offsets(mdev, mdev->ldev);
rcu_read_lock();
......@@ -852,6 +883,21 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
rcu_read_unlock();
size = drbd_new_dev_size(mdev, mdev->ldev, u_size, flags & DDSF_FORCED);
if (size < la_size_sect) {
if (rs && u_size == 0) {
/* Remove "rs &&" later. This check should always be active, but
right now the receiver expects the permissive behavior */
dev_warn(DEV, "Implicit shrink not allowed. "
"Use --size=%llus for explicit shrink.\n",
(unsigned long long)size);
rv = DS_ERROR_SHRINK;
}
if (u_size > size)
rv = DS_ERROR_SPACE_MD;
if (rv != DS_UNCHANGED)
goto err_out;
}
if (drbd_get_capacity(mdev->this_bdev) != size ||
drbd_bm_capacity(mdev) != size) {
int err;
......@@ -867,7 +913,7 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
"Leaving size unchanged at size = %lu KB\n",
(unsigned long)size);
}
rv = dev_size_error;
rv = DS_ERROR;
}
/* racy, see comments above. */
drbd_set_my_capacity(mdev, size);
......@@ -875,38 +921,57 @@ enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *mdev, enum dds
dev_info(DEV, "size = %s (%llu KB)\n", ppsize(ppb, size>>1),
(unsigned long long)size>>1);
}
if (rv == dev_size_error)
goto out;
if (rv <= DS_ERROR)
goto err_out;
la_size_changed = (la_size_sect != mdev->ldev->md.la_size_sect);
md_moved = prev_first_sect != drbd_md_first_sector(mdev->ldev)
|| prev_size != mdev->ldev->md.md_size_sect;
if (la_size_changed || md_moved) {
int err;
if (la_size_changed || md_moved || rs) {
u32 prev_flags;
drbd_al_shrink(mdev); /* All extents inactive. */
prev_flags = md->flags;
md->flags &= ~MDF_PRIMARY_IND;
drbd_md_write(mdev, buffer);
dev_info(DEV, "Writing the whole bitmap, %s\n",
la_size_changed && md_moved ? "size changed and md moved" :
la_size_changed ? "size changed" : "md moved");
/* next line implicitly does drbd_suspend_io()+drbd_resume_io() */
err = drbd_bitmap_io(mdev, md_moved ? &drbd_bm_write_all : &drbd_bm_write,
"size changed", BM_LOCKED_MASK);
if (err) {
rv = dev_size_error;
goto out;
}
drbd_md_mark_dirty(mdev);
drbd_bitmap_io(mdev, md_moved ? &drbd_bm_write_all : &drbd_bm_write,
"size changed", BM_LOCKED_MASK);
drbd_initialize_al(mdev, buffer);
md->flags = prev_flags;
drbd_md_write(mdev, buffer);
if (rs)
dev_info(DEV, "Changed AL layout to al-stripes = %d, al-stripe-size-kB = %d\n",
md->al_stripes, md->al_stripe_size_4k * 4);
}
if (size > la_size_sect)
rv = grew;
rv = DS_GREW;
if (size < la_size_sect)
rv = shrunk;
out:
rv = DS_SHRUNK;
if (0) {
err_out:
if (rs) {
md->al_stripes = prev_al_stripes;
md->al_stripe_size_4k = prev_al_stripe_size_4k;
md->al_size_4k = (u64)prev_al_stripes * prev_al_stripe_size_4k;
drbd_md_set_sector_offsets(mdev, mdev->ldev);
}
}
lc_unlock(mdev->act_log);
wake_up(&mdev->al_wait);
drbd_md_put_buffer(mdev);
drbd_resume_io(mdev);
return rv;
......@@ -1607,11 +1672,11 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
!drbd_md_test_flag(mdev->ldev, MDF_CONNECTED_IND))
set_bit(USE_DEGR_WFC_T, &mdev->flags);
dd = drbd_determine_dev_size(mdev, 0);
if (dd == dev_size_error) {
dd = drbd_determine_dev_size(mdev, 0, NULL);
if (dd <= DS_ERROR) {
retcode = ERR_NOMEM_BITMAP;
goto force_diskless_dec;
} else if (dd == grew)
} else if (dd == DS_GREW)
set_bit(RESYNC_AFTER_NEG, &mdev->flags);
if (drbd_md_test_flag(mdev->ldev, MDF_FULL_SYNC) ||
......@@ -2305,6 +2370,7 @@ int drbd_adm_resize(struct sk_buff *skb, struct genl_info *info)
struct drbd_conf *mdev;
enum drbd_ret_code retcode;
enum determine_dev_size dd;
bool change_al_layout = false;
enum dds_flags ddsf;
sector_t u_size;
int err;
......@@ -2315,31 +2381,33 @@ int drbd_adm_resize(struct sk_buff *skb, struct genl_info *info)
if (retcode != NO_ERROR)
goto fail;
mdev = adm_ctx.mdev;
if (!get_ldev(mdev)) {
retcode = ERR_NO_DISK;
goto fail;
}
memset(&rs, 0, sizeof(struct resize_parms));
rs.al_stripes = mdev->ldev->md.al_stripes;
rs.al_stripe_size = mdev->ldev->md.al_stripe_size_4k * 4;
if (info->attrs[DRBD_NLA_RESIZE_PARMS]) {
err = resize_parms_from_attrs(&rs, info);
if (err) {
retcode = ERR_MANDATORY_TAG;
drbd_msg_put_info(from_attrs_err_to_txt(err));
goto fail;
goto fail_ldev;
}
}
mdev = adm_ctx.mdev;
if (mdev->state.conn > C_CONNECTED) {
retcode = ERR_RESIZE_RESYNC;
goto fail;
goto fail_ldev;
}
if (mdev->state.role == R_SECONDARY &&
mdev->state.peer == R_SECONDARY) {
retcode = ERR_NO_PRIMARY;
goto fail;
}
if (!get_ldev(mdev)) {
retcode = ERR_NO_DISK;
goto fail;
goto fail_ldev;
}
if (rs.no_resync && mdev->tconn->agreed_pro_version < 93) {
......@@ -2358,6 +2426,28 @@ int drbd_adm_resize(struct sk_buff *skb, struct genl_info *info)
}
}
if (mdev->ldev->md.al_stripes != rs.al_stripes ||
mdev->ldev->md.al_stripe_size_4k != rs.al_stripe_size / 4) {
u32 al_size_k = rs.al_stripes * rs.al_stripe_size;
if (al_size_k > (16 * 1024 * 1024)) {
retcode = ERR_MD_LAYOUT_TOO_BIG;
goto fail_ldev;
}
if (al_size_k < MD_32kB_SECT/2) {
retcode = ERR_MD_LAYOUT_TOO_SMALL;
goto fail_ldev;
}
if (mdev->state.conn != C_CONNECTED) {
retcode = ERR_MD_LAYOUT_CONNECTED;
goto fail_ldev;
}
change_al_layout = true;
}
if (mdev->ldev->known_size != drbd_get_capacity(mdev->ldev->backing_bdev))
mdev->ldev->known_size = drbd_get_capacity(mdev->ldev->backing_bdev);
......@@ -2373,16 +2463,22 @@ int drbd_adm_resize(struct sk_buff *skb, struct genl_info *info)
}
ddsf = (rs.resize_force ? DDSF_FORCED : 0) | (rs.no_resync ? DDSF_NO_RESYNC : 0);
dd = drbd_determine_dev_size(mdev, ddsf);
dd = drbd_determine_dev_size(mdev, ddsf, change_al_layout ? &rs : NULL);
drbd_md_sync(mdev);
put_ldev(mdev);
if (dd == dev_size_error) {
if (dd == DS_ERROR) {
retcode = ERR_NOMEM_BITMAP;
goto fail;
} else if (dd == DS_ERROR_SPACE_MD) {
retcode = ERR_MD_LAYOUT_NO_FIT;
goto fail;
} else if (dd == DS_ERROR_SHRINK) {
retcode = ERR_IMPLICIT_SHRINK;
goto fail;
}
if (mdev->state.conn == C_CONNECTED) {
if (dd == grew)
if (dd == DS_GREW)
set_bit(RESIZE_PENDING, &mdev->flags);
drbd_send_uuids(mdev);
......@@ -2658,7 +2754,6 @@ int nla_put_status_info(struct sk_buff *skb, struct drbd_conf *mdev,
const struct sib_info *sib)
{
struct state_info *si = NULL; /* for sizeof(si->member); */
struct net_conf *nc;
struct nlattr *nla;
int got_ldev;
int err = 0;
......@@ -2688,13 +2783,19 @@ int nla_put_status_info(struct sk_buff *skb, struct drbd_conf *mdev,
goto nla_put_failure;
rcu_read_lock();
if (got_ldev)
if (disk_conf_to_skb(skb, rcu_dereference(mdev->ldev->disk_conf), exclude_sensitive))
goto nla_put_failure;
if (got_ldev) {
struct disk_conf *disk_conf;
nc = rcu_dereference(mdev->tconn->net_conf);
if (nc)
err = net_conf_to_skb(skb, nc, exclude_sensitive);
disk_conf = rcu_dereference(mdev->ldev->disk_conf);
err = disk_conf_to_skb(skb, disk_conf, exclude_sensitive);
}
if (!err) {
struct net_conf *nc;
nc = rcu_dereference(mdev->tconn->net_conf);
if (nc)
err = net_conf_to_skb(skb, nc, exclude_sensitive);
}
rcu_read_unlock();
if (err)
goto nla_put_failure;
......
......@@ -1039,6 +1039,8 @@ static int conn_connect(struct drbd_tconn *tconn)
rcu_read_lock();
idr_for_each_entry(&tconn->volumes, mdev, vnr) {
kref_get(&mdev->kref);
rcu_read_unlock();
/* Prevent a race between resync-handshake and
* being promoted to Primary.
*
......@@ -1049,8 +1051,6 @@ static int conn_connect(struct drbd_tconn *tconn)
mutex_lock(mdev->state_mutex);
mutex_unlock(mdev->state_mutex);
rcu_read_unlock();
if (discard_my_data)
set_bit(DISCARD_MY_DATA, &mdev->flags);
else
......@@ -3545,7 +3545,7 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
{
struct drbd_conf *mdev;
struct p_sizes *p = pi->data;
enum determine_dev_size dd = unchanged;
enum determine_dev_size dd = DS_UNCHANGED;
sector_t p_size, p_usize, my_usize;
int ldsc = 0; /* local disk size changed */
enum dds_flags ddsf;
......@@ -3617,9 +3617,9 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
ddsf = be16_to_cpu(p->dds_flags);
if (get_ldev(mdev)) {
dd = drbd_determine_dev_size(mdev, ddsf);
dd = drbd_determine_dev_size(mdev, ddsf, NULL);
put_ldev(mdev);
if (dd == dev_size_error)
if (dd == DS_ERROR)
return -EIO;
drbd_md_sync(mdev);
} else {
......@@ -3647,7 +3647,7 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
drbd_send_sizes(mdev, 0, ddsf);
}
if (test_and_clear_bit(RESIZE_PENDING, &mdev->flags) ||
(dd == grew && mdev->state.conn == C_CONNECTED)) {
(dd == DS_GREW && mdev->state.conn == C_CONNECTED)) {
if (mdev->state.pdsk >= D_INCONSISTENT &&
mdev->state.disk >= D_INCONSISTENT) {
if (ddsf & DDSF_NO_RESYNC)
......
......@@ -1115,8 +1115,10 @@ __drbd_set_state(struct drbd_conf *mdev, union drbd_state ns,
drbd_thread_restart_nowait(&mdev->tconn->receiver);
/* Resume AL writing if we get a connection */
if (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED)
if (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED) {
drbd_resume_al(mdev);
mdev->tconn->connect_cnt++;
}
/* remember last attach time so request_timer_fn() won't
* kill newly established sessions while we are still trying to thaw
......
......@@ -31,6 +31,8 @@
#include <linux/slab.h>
#include <linux/bitops.h>
#include <linux/delay.h>
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/genhd.h>
#include <linux/idr.h>
......@@ -39,8 +41,9 @@
#include "rsxx_cfg.h"
#define NO_LEGACY 0
#define SYNC_START_TIMEOUT (10 * 60) /* 10 minutes */
MODULE_DESCRIPTION("IBM FlashSystem 70/80 PCIe SSD Device Driver");
MODULE_DESCRIPTION("IBM Flash Adapter 900GB Full Height Device Driver");
MODULE_AUTHOR("Joshua Morris/Philip Kelleher, IBM");
MODULE_LICENSE("GPL");
MODULE_VERSION(DRIVER_VERSION);
......@@ -49,9 +52,282 @@ static unsigned int force_legacy = NO_LEGACY;
module_param(force_legacy, uint, 0444);
MODULE_PARM_DESC(force_legacy, "Force the use of legacy type PCI interrupts");
static unsigned int sync_start = 1;
module_param(sync_start, uint, 0444);
MODULE_PARM_DESC(sync_start, "On by Default: Driver load will not complete "
"until the card startup has completed.");
static DEFINE_IDA(rsxx_disk_ida);
static DEFINE_SPINLOCK(rsxx_ida_lock);
/* --------------------Debugfs Setup ------------------- */
struct rsxx_cram {
u32 f_pos;
u32 offset;
void *i_private;
};
static int rsxx_attr_pci_regs_show(struct seq_file *m, void *p)
{
struct rsxx_cardinfo *card = m->private;
seq_printf(m, "HWID 0x%08x\n",
ioread32(card->regmap + HWID));
seq_printf(m, "SCRATCH 0x%08x\n",
ioread32(card->regmap + SCRATCH));
seq_printf(m, "IER 0x%08x\n",
ioread32(card->regmap + IER));
seq_printf(m, "IPR 0x%08x\n",
ioread32(card->regmap + IPR));
seq_printf(m, "CREG_CMD 0x%08x\n",
ioread32(card->regmap + CREG_CMD));
seq_printf(m, "CREG_ADD 0x%08x\n",
ioread32(card->regmap + CREG_ADD));
seq_printf(m, "CREG_CNT 0x%08x\n",
ioread32(card->regmap + CREG_CNT));
seq_printf(m, "CREG_STAT 0x%08x\n",
ioread32(card->regmap + CREG_STAT));
seq_printf(m, "CREG_DATA0 0x%08x\n",
ioread32(card->regmap + CREG_DATA0));
seq_printf(m, "CREG_DATA1 0x%08x\n",
ioread32(card->regmap + CREG_DATA1));
seq_printf(m, "CREG_DATA2 0x%08x\n",
ioread32(card->regmap + CREG_DATA2));
seq_printf(m, "CREG_DATA3 0x%08x\n",
ioread32(card->regmap + CREG_DATA3));
seq_printf(m, "CREG_DATA4 0x%08x\n",
ioread32(card->regmap + CREG_DATA4));
seq_printf(m, "CREG_DATA5 0x%08x\n",
ioread32(card->regmap + CREG_DATA5));
seq_printf(m, "CREG_DATA6 0x%08x\n",
ioread32(card->regmap + CREG_DATA6));
seq_printf(m, "CREG_DATA7 0x%08x\n",
ioread32(card->regmap + CREG_DATA7));
seq_printf(m, "INTR_COAL 0x%08x\n",
ioread32(card->regmap + INTR_COAL));
seq_printf(m, "HW_ERROR 0x%08x\n",
ioread32(card->regmap + HW_ERROR));
seq_printf(m, "DEBUG0 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG0));
seq_printf(m, "DEBUG1 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG1));
seq_printf(m, "DEBUG2 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG2));
seq_printf(m, "DEBUG3 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG3));
seq_printf(m, "DEBUG4 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG4));
seq_printf(m, "DEBUG5 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG5));
seq_printf(m, "DEBUG6 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG6));
seq_printf(m, "DEBUG7 0x%08x\n",
ioread32(card->regmap + PCI_DEBUG7));
seq_printf(m, "RECONFIG 0x%08x\n",
ioread32(card->regmap + PCI_RECONFIG));
return 0;
}
static int rsxx_attr_stats_show(struct seq_file *m, void *p)
{
struct rsxx_cardinfo *card = m->private;
int i;
for (i = 0; i < card->n_targets; i++) {
seq_printf(m, "Ctrl %d CRC Errors = %d\n",
i, card->ctrl[i].stats.crc_errors);
seq_printf(m, "Ctrl %d Hard Errors = %d\n",
i, card->ctrl[i].stats.hard_errors);
seq_printf(m, "Ctrl %d Soft Errors = %d\n",
i, card->ctrl[i].stats.soft_errors);
seq_printf(m, "Ctrl %d Writes Issued = %d\n",
i, card->ctrl[i].stats.writes_issued);
seq_printf(m, "Ctrl %d Writes Failed = %d\n",
i, card->ctrl[i].stats.writes_failed);
seq_printf(m, "Ctrl %d Reads Issued = %d\n",
i, card->ctrl[i].stats.reads_issued);
seq_printf(m, "Ctrl %d Reads Failed = %d\n",
i, card->ctrl[i].stats.reads_failed);
seq_printf(m, "Ctrl %d Reads Retried = %d\n",
i, card->ctrl[i].stats.reads_retried);
seq_printf(m, "Ctrl %d Discards Issued = %d\n",
i, card->ctrl[i].stats.discards_issued);
seq_printf(m, "Ctrl %d Discards Failed = %d\n",
i, card->ctrl[i].stats.discards_failed);
seq_printf(m, "Ctrl %d DMA SW Errors = %d\n",
i, card->ctrl[i].stats.dma_sw_err);
seq_printf(m, "Ctrl %d DMA HW Faults = %d\n",
i, card->ctrl[i].stats.dma_hw_fault);
seq_printf(m, "Ctrl %d DMAs Cancelled = %d\n",
i, card->ctrl[i].stats.dma_cancelled);
seq_printf(m, "Ctrl %d SW Queue Depth = %d\n",
i, card->ctrl[i].stats.sw_q_depth);
seq_printf(m, "Ctrl %d HW Queue Depth = %d\n",
i, atomic_read(&card->ctrl[i].stats.hw_q_depth));
}
return 0;
}
static int rsxx_attr_stats_open(struct inode *inode, struct file *file)
{
return single_open(file, rsxx_attr_stats_show, inode->i_private);
}
static int rsxx_attr_pci_regs_open(struct inode *inode, struct file *file)
{
return single_open(file, rsxx_attr_pci_regs_show, inode->i_private);
}
static ssize_t rsxx_cram_read(struct file *fp, char __user *ubuf,
size_t cnt, loff_t *ppos)
{
struct rsxx_cram *info = fp->private_data;
struct rsxx_cardinfo *card = info->i_private;
char *buf;
int st;
buf = kzalloc(sizeof(*buf) * cnt, GFP_KERNEL);
if (!buf)
return -ENOMEM;
info->f_pos = (u32)*ppos + info->offset;
st = rsxx_creg_read(card, CREG_ADD_CRAM + info->f_pos, cnt, buf, 1);
if (st)
return st;
st = copy_to_user(ubuf, buf, cnt);
if (st)
return st;
info->offset += cnt;
kfree(buf);
return cnt;
}
static ssize_t rsxx_cram_write(struct file *fp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
{
struct rsxx_cram *info = fp->private_data;
struct rsxx_cardinfo *card = info->i_private;
char *buf;
int st;
buf = kzalloc(sizeof(*buf) * cnt, GFP_KERNEL);
if (!buf)
return -ENOMEM;
st = copy_from_user(buf, ubuf, cnt);
if (st)
return st;
info->f_pos = (u32)*ppos + info->offset;
st = rsxx_creg_write(card, CREG_ADD_CRAM + info->f_pos, cnt, buf, 1);
if (st)
return st;
info->offset += cnt;
kfree(buf);
return cnt;
}
static int rsxx_cram_open(struct inode *inode, struct file *file)
{
struct rsxx_cram *info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
return -ENOMEM;
info->i_private = inode->i_private;
info->f_pos = file->f_pos;
file->private_data = info;
return 0;
}
static int rsxx_cram_release(struct inode *inode, struct file *file)
{
struct rsxx_cram *info = file->private_data;
if (!info)
return 0;
kfree(info);
file->private_data = NULL;
return 0;
}
static const struct file_operations debugfs_cram_fops = {
.owner = THIS_MODULE,
.open = rsxx_cram_open,
.read = rsxx_cram_read,
.write = rsxx_cram_write,
.release = rsxx_cram_release,
};
static const struct file_operations debugfs_stats_fops = {
.owner = THIS_MODULE,
.open = rsxx_attr_stats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static const struct file_operations debugfs_pci_regs_fops = {
.owner = THIS_MODULE,
.open = rsxx_attr_pci_regs_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static void rsxx_debugfs_dev_new(struct rsxx_cardinfo *card)
{
struct dentry *debugfs_stats;
struct dentry *debugfs_pci_regs;
struct dentry *debugfs_cram;
card->debugfs_dir = debugfs_create_dir(card->gendisk->disk_name, NULL);
if (IS_ERR_OR_NULL(card->debugfs_dir))
goto failed_debugfs_dir;
debugfs_stats = debugfs_create_file("stats", S_IRUGO,
card->debugfs_dir, card,
&debugfs_stats_fops);
if (IS_ERR_OR_NULL(debugfs_stats))
goto failed_debugfs_stats;
debugfs_pci_regs = debugfs_create_file("pci_regs", S_IRUGO,
card->debugfs_dir, card,
&debugfs_pci_regs_fops);
if (IS_ERR_OR_NULL(debugfs_pci_regs))
goto failed_debugfs_pci_regs;
debugfs_cram = debugfs_create_file("cram", S_IRUGO | S_IWUSR,
card->debugfs_dir, card,
&debugfs_cram_fops);
if (IS_ERR_OR_NULL(debugfs_cram))
goto failed_debugfs_cram;
return;
failed_debugfs_cram:
debugfs_remove(debugfs_pci_regs);
failed_debugfs_pci_regs:
debugfs_remove(debugfs_stats);
failed_debugfs_stats:
debugfs_remove(card->debugfs_dir);
failed_debugfs_dir:
card->debugfs_dir = NULL;
}
/*----------------- Interrupt Control & Handling -------------------*/
static void rsxx_mask_interrupts(struct rsxx_cardinfo *card)
......@@ -163,12 +439,13 @@ static irqreturn_t rsxx_isr(int irq, void *pdata)
}
if (isr & CR_INTR_CREG) {
schedule_work(&card->creg_ctrl.done_work);
queue_work(card->creg_ctrl.creg_wq,
&card->creg_ctrl.done_work);
handled++;
}
if (isr & CR_INTR_EVENT) {
schedule_work(&card->event_work);
queue_work(card->event_wq, &card->event_work);
rsxx_disable_ier_and_isr(card, CR_INTR_EVENT);
handled++;
}
......@@ -329,7 +606,7 @@ static int rsxx_eeh_frozen(struct pci_dev *dev)
int i;
int st;
dev_warn(&dev->dev, "IBM FlashSystem PCI: preparing for slot reset.\n");
dev_warn(&dev->dev, "IBM Flash Adapter PCI: preparing for slot reset.\n");
card->eeh_state = 1;
rsxx_mask_interrupts(card);
......@@ -367,15 +644,26 @@ static void rsxx_eeh_failure(struct pci_dev *dev)
{
struct rsxx_cardinfo *card = pci_get_drvdata(dev);
int i;
int cnt = 0;
dev_err(&dev->dev, "IBM FlashSystem PCI: disabling failed card.\n");
dev_err(&dev->dev, "IBM Flash Adapter PCI: disabling failed card.\n");
card->eeh_state = 1;
card->halt = 1;
for (i = 0; i < card->n_targets; i++)
del_timer_sync(&card->ctrl[i].activity_timer);
for (i = 0; i < card->n_targets; i++) {
spin_lock_bh(&card->ctrl[i].queue_lock);
cnt = rsxx_cleanup_dma_queue(&card->ctrl[i],
&card->ctrl[i].queue);
spin_unlock_bh(&card->ctrl[i].queue_lock);
cnt += rsxx_dma_cancel(&card->ctrl[i]);
rsxx_eeh_cancel_dmas(card);
if (cnt)
dev_info(CARD_TO_DEV(card),
"Freed %d queued DMAs on channel %d\n",
cnt, card->ctrl[i].id);
}
}
static int rsxx_eeh_fifo_flush_poll(struct rsxx_cardinfo *card)
......@@ -432,7 +720,7 @@ static pci_ers_result_t rsxx_slot_reset(struct pci_dev *dev)
int st;
dev_warn(&dev->dev,
"IBM FlashSystem PCI: recovering from slot reset.\n");
"IBM Flash Adapter PCI: recovering from slot reset.\n");
st = pci_enable_device(dev);
if (st)
......@@ -485,7 +773,7 @@ static pci_ers_result_t rsxx_slot_reset(struct pci_dev *dev)
&card->ctrl[i].issue_dma_work);
}
dev_info(&dev->dev, "IBM FlashSystem PCI: recovery complete.\n");
dev_info(&dev->dev, "IBM Flash Adapter PCI: recovery complete.\n");
return PCI_ERS_RESULT_RECOVERED;
......@@ -528,6 +816,7 @@ static int rsxx_pci_probe(struct pci_dev *dev,
{
struct rsxx_cardinfo *card;
int st;
unsigned int sync_timeout;
dev_info(&dev->dev, "PCI-Flash SSD discovered\n");
......@@ -610,7 +899,11 @@ static int rsxx_pci_probe(struct pci_dev *dev,
}
/************* Setup Processor Command Interface *************/
rsxx_creg_setup(card);
st = rsxx_creg_setup(card);
if (st) {
dev_err(CARD_TO_DEV(card), "Failed to setup creg interface.\n");
goto failed_creg_setup;
}
spin_lock_irq(&card->irq_lock);
rsxx_enable_ier_and_isr(card, CR_INTR_CREG);
......@@ -650,6 +943,12 @@ static int rsxx_pci_probe(struct pci_dev *dev,
}
/************* Setup Card Event Handler *************/
card->event_wq = create_singlethread_workqueue(DRIVER_NAME"_event");
if (!card->event_wq) {
dev_err(CARD_TO_DEV(card), "Failed card event setup.\n");
goto failed_event_handler;
}
INIT_WORK(&card->event_work, card_event_handler);
st = rsxx_setup_dev(card);
......@@ -676,6 +975,33 @@ static int rsxx_pci_probe(struct pci_dev *dev,
if (st)
dev_crit(CARD_TO_DEV(card),
"Failed issuing card startup\n");
if (sync_start) {
sync_timeout = SYNC_START_TIMEOUT;
dev_info(CARD_TO_DEV(card),
"Waiting for card to startup\n");
do {
ssleep(1);
sync_timeout--;
rsxx_get_card_state(card, &card->state);
} while (sync_timeout &&
(card->state == CARD_STATE_STARTING));
if (card->state == CARD_STATE_STARTING) {
dev_warn(CARD_TO_DEV(card),
"Card startup timed out\n");
card->size8 = 0;
} else {
dev_info(CARD_TO_DEV(card),
"card state: %s\n",
rsxx_card_state_to_str(card->state));
st = rsxx_get_card_size8(card, &card->size8);
if (st)
card->size8 = 0;
}
}
} else if (card->state == CARD_STATE_GOOD ||
card->state == CARD_STATE_RD_ONLY_FAULT) {
st = rsxx_get_card_size8(card, &card->size8);
......@@ -685,12 +1011,21 @@ static int rsxx_pci_probe(struct pci_dev *dev,
rsxx_attach_dev(card);
/************* Setup Debugfs *************/
rsxx_debugfs_dev_new(card);
return 0;
failed_create_dev:
destroy_workqueue(card->event_wq);
card->event_wq = NULL;
failed_event_handler:
rsxx_dma_destroy(card);
failed_dma_setup:
failed_compatiblity_check:
destroy_workqueue(card->creg_ctrl.creg_wq);
card->creg_ctrl.creg_wq = NULL;
failed_creg_setup:
spin_lock_irq(&card->irq_lock);
rsxx_disable_ier_and_isr(card, CR_INTR_ALL);
spin_unlock_irq(&card->irq_lock);
......@@ -756,6 +1091,8 @@ static void rsxx_pci_remove(struct pci_dev *dev)
/* Prevent work_structs from re-queuing themselves. */
card->halt = 1;
debugfs_remove_recursive(card->debugfs_dir);
free_irq(dev->irq, card);
if (!force_legacy)
......
......@@ -431,6 +431,15 @@ static int __issue_creg_rw(struct rsxx_cardinfo *card,
*hw_stat = completion.creg_status;
if (completion.st) {
/*
* This read is needed to verify that there has not been any
* extreme errors that might have occurred, i.e. EEH. The
* function iowrite32 will not detect EEH errors, so it is
* necessary that we recover if such an error is the reason
* for the timeout. This is a dummy read.
*/
ioread32(card->regmap + SCRATCH);
dev_warn(CARD_TO_DEV(card),
"creg command failed(%d x%08x)\n",
completion.st, addr);
......@@ -727,6 +736,11 @@ int rsxx_creg_setup(struct rsxx_cardinfo *card)
{
card->creg_ctrl.active_cmd = NULL;
card->creg_ctrl.creg_wq =
create_singlethread_workqueue(DRIVER_NAME"_creg");
if (!card->creg_ctrl.creg_wq)
return -ENOMEM;
INIT_WORK(&card->creg_ctrl.done_work, creg_cmd_done);
mutex_init(&card->creg_ctrl.reset_lock);
INIT_LIST_HEAD(&card->creg_ctrl.queue);
......
......@@ -155,7 +155,8 @@ static void bio_dma_done_cb(struct rsxx_cardinfo *card,
atomic_set(&meta->error, 1);
if (atomic_dec_and_test(&meta->pending_dmas)) {
disk_stats_complete(card, meta->bio, meta->start_time);
if (!card->eeh_state && card->gendisk)
disk_stats_complete(card, meta->bio, meta->start_time);
bio_endio(meta->bio, atomic_read(&meta->error) ? -EIO : 0);
kmem_cache_free(bio_meta_pool, meta);
......@@ -170,6 +171,12 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
might_sleep();
if (!card)
goto req_err;
if (bio->bi_sector + (bio->bi_size >> 9) > get_capacity(card->gendisk))
goto req_err;
if (unlikely(card->halt)) {
st = -EFAULT;
goto req_err;
......@@ -196,7 +203,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
atomic_set(&bio_meta->pending_dmas, 0);
bio_meta->start_time = jiffies;
disk_stats_start(card, bio);
if (!unlikely(card->halt))
disk_stats_start(card, bio);
dev_dbg(CARD_TO_DEV(card), "BIO[%c]: meta: %p addr8: x%llx size: %d\n",
bio_data_dir(bio) ? 'W' : 'R', bio_meta,
......@@ -225,24 +233,6 @@ static bool rsxx_discard_supported(struct rsxx_cardinfo *card)
return (pci_rev >= RSXX_DISCARD_SUPPORT);
}
static unsigned short rsxx_get_logical_block_size(
struct rsxx_cardinfo *card)
{
u32 capabilities = 0;
int st;
st = rsxx_get_card_capabilities(card, &capabilities);
if (st)
dev_warn(CARD_TO_DEV(card),
"Failed reading card capabilities register\n");
/* Earlier firmware did not have support for 512 byte accesses */
if (capabilities & CARD_CAP_SUBPAGE_WRITES)
return 512;
else
return RSXX_HW_BLK_SIZE;
}
int rsxx_attach_dev(struct rsxx_cardinfo *card)
{
mutex_lock(&card->dev_lock);
......@@ -305,7 +295,7 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
return -ENOMEM;
}
blk_size = rsxx_get_logical_block_size(card);
blk_size = card->config.data.block_size;
blk_queue_make_request(card->queue, rsxx_make_request);
blk_queue_bounce_limit(card->queue, BLK_BOUNCE_ANY);
......@@ -347,6 +337,7 @@ void rsxx_destroy_dev(struct rsxx_cardinfo *card)
card->gendisk = NULL;
blk_cleanup_queue(card->queue);
card->queue->queuedata = NULL;
unregister_blkdev(card->major, DRIVER_NAME);
}
......
......@@ -245,6 +245,22 @@ static void rsxx_complete_dma(struct rsxx_dma_ctrl *ctrl,
kmem_cache_free(rsxx_dma_pool, dma);
}
int rsxx_cleanup_dma_queue(struct rsxx_dma_ctrl *ctrl,
struct list_head *q)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int cnt = 0;
list_for_each_entry_safe(dma, tmp, q, list) {
list_del(&dma->list);
rsxx_complete_dma(ctrl, dma, DMA_CANCELLED);
cnt++;
}
return cnt;
}
static void rsxx_requeue_dma(struct rsxx_dma_ctrl *ctrl,
struct rsxx_dma *dma)
{
......@@ -252,9 +268,10 @@ static void rsxx_requeue_dma(struct rsxx_dma_ctrl *ctrl,
* Requeued DMAs go to the front of the queue so they are issued
* first.
*/
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
ctrl->stats.sw_q_depth++;
list_add(&dma->list, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
}
static void rsxx_handle_dma_error(struct rsxx_dma_ctrl *ctrl,
......@@ -329,6 +346,7 @@ static void rsxx_handle_dma_error(struct rsxx_dma_ctrl *ctrl,
static void dma_engine_stalled(unsigned long data)
{
struct rsxx_dma_ctrl *ctrl = (struct rsxx_dma_ctrl *)data;
int cnt;
if (atomic_read(&ctrl->stats.hw_q_depth) == 0 ||
unlikely(ctrl->card->eeh_state))
......@@ -349,18 +367,28 @@ static void dma_engine_stalled(unsigned long data)
"DMA channel %d has stalled, faulting interface.\n",
ctrl->id);
ctrl->card->dma_fault = 1;
/* Clean up the DMA queue */
spin_lock(&ctrl->queue_lock);
cnt = rsxx_cleanup_dma_queue(ctrl, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
cnt += rsxx_dma_cancel(ctrl);
if (cnt)
dev_info(CARD_TO_DEV(ctrl->card),
"Freed %d queued DMAs on channel %d\n",
cnt, ctrl->id);
}
}
static void rsxx_issue_dmas(struct work_struct *work)
static void rsxx_issue_dmas(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
int tag;
int cmds_pending = 0;
struct hw_cmd *hw_cmd_buf;
ctrl = container_of(work, struct rsxx_dma_ctrl, issue_dma_work);
hw_cmd_buf = ctrl->cmd.buf;
if (unlikely(ctrl->card->halt) ||
......@@ -368,22 +396,22 @@ static void rsxx_issue_dmas(struct work_struct *work)
return;
while (1) {
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
if (list_empty(&ctrl->queue)) {
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
break;
}
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
tag = pop_tracker(ctrl->trackers);
if (tag == -1)
break;
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
dma = list_entry(ctrl->queue.next, struct rsxx_dma, list);
list_del(&dma->list);
ctrl->stats.sw_q_depth--;
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
/*
* This will catch any DMAs that slipped in right before the
......@@ -440,9 +468,8 @@ static void rsxx_issue_dmas(struct work_struct *work)
}
}
static void rsxx_dma_done(struct work_struct *work)
static void rsxx_dma_done(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
unsigned long flags;
u16 count;
......@@ -450,7 +477,6 @@ static void rsxx_dma_done(struct work_struct *work)
u8 tag;
struct hw_status *hw_st_buf;
ctrl = container_of(work, struct rsxx_dma_ctrl, dma_done_work);
hw_st_buf = ctrl->status.buf;
if (unlikely(ctrl->card->halt) ||
......@@ -520,33 +546,32 @@ static void rsxx_dma_done(struct work_struct *work)
rsxx_enable_ier(ctrl->card, CR_INTR_DMA(ctrl->id));
spin_unlock_irqrestore(&ctrl->card->irq_lock, flags);
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
if (ctrl->stats.sw_q_depth)
queue_work(ctrl->issue_wq, &ctrl->issue_dma_work);
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
}
static int rsxx_cleanup_dma_queue(struct rsxx_cardinfo *card,
struct list_head *q)
static void rsxx_schedule_issue(struct work_struct *work)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int cnt = 0;
struct rsxx_dma_ctrl *ctrl;
list_for_each_entry_safe(dma, tmp, q, list) {
list_del(&dma->list);
ctrl = container_of(work, struct rsxx_dma_ctrl, issue_dma_work);
if (dma->dma_addr)
pci_unmap_page(card->dev, dma->dma_addr,
get_dma_size(dma),
(dma->cmd == HW_CMD_BLK_WRITE) ?
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
kmem_cache_free(rsxx_dma_pool, dma);
cnt++;
}
mutex_lock(&ctrl->work_lock);
rsxx_issue_dmas(ctrl);
mutex_unlock(&ctrl->work_lock);
}
return cnt;
static void rsxx_schedule_done(struct work_struct *work)
{
struct rsxx_dma_ctrl *ctrl;
ctrl = container_of(work, struct rsxx_dma_ctrl, dma_done_work);
mutex_lock(&ctrl->work_lock);
rsxx_dma_done(ctrl);
mutex_unlock(&ctrl->work_lock);
}
static int rsxx_queue_discard(struct rsxx_cardinfo *card,
......@@ -698,10 +723,10 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
for (i = 0; i < card->n_targets; i++) {
if (!list_empty(&dma_list[i])) {
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
card->ctrl[i].stats.sw_q_depth += dma_cnt[i];
list_splice_tail(&dma_list[i], &card->ctrl[i].queue);
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
queue_work(card->ctrl[i].issue_wq,
&card->ctrl[i].issue_dma_work);
......@@ -711,8 +736,11 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
return 0;
bvec_err:
for (i = 0; i < card->n_targets; i++)
rsxx_cleanup_dma_queue(card, &dma_list[i]);
for (i = 0; i < card->n_targets; i++) {
spin_lock_bh(&card->ctrl[i].queue_lock);
rsxx_cleanup_dma_queue(&card->ctrl[i], &dma_list[i]);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
return st;
}
......@@ -780,6 +808,7 @@ static int rsxx_dma_ctrl_init(struct pci_dev *dev,
spin_lock_init(&ctrl->trackers->lock);
spin_lock_init(&ctrl->queue_lock);
mutex_init(&ctrl->work_lock);
INIT_LIST_HEAD(&ctrl->queue);
setup_timer(&ctrl->activity_timer, dma_engine_stalled,
......@@ -793,8 +822,8 @@ static int rsxx_dma_ctrl_init(struct pci_dev *dev,
if (!ctrl->done_wq)
return -ENOMEM;
INIT_WORK(&ctrl->issue_dma_work, rsxx_issue_dmas);
INIT_WORK(&ctrl->dma_done_work, rsxx_dma_done);
INIT_WORK(&ctrl->issue_dma_work, rsxx_schedule_issue);
INIT_WORK(&ctrl->dma_done_work, rsxx_schedule_done);
st = rsxx_hw_buffers_init(dev, ctrl);
if (st)
......@@ -918,13 +947,30 @@ int rsxx_dma_setup(struct rsxx_cardinfo *card)
return st;
}
int rsxx_dma_cancel(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma *dma;
int i;
int cnt = 0;
/* Clean up issued DMAs */
for (i = 0; i < RSXX_MAX_OUTSTANDING_CMDS; i++) {
dma = get_tracker_dma(ctrl->trackers, i);
if (dma) {
atomic_dec(&ctrl->stats.hw_q_depth);
rsxx_complete_dma(ctrl, dma, DMA_CANCELLED);
push_tracker(ctrl->trackers, i);
cnt++;
}
}
return cnt;
}
void rsxx_dma_destroy(struct rsxx_cardinfo *card)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
int i, j;
int cnt = 0;
int i;
for (i = 0; i < card->n_targets; i++) {
ctrl = &card->ctrl[i];
......@@ -943,33 +989,11 @@ void rsxx_dma_destroy(struct rsxx_cardinfo *card)
del_timer_sync(&ctrl->activity_timer);
/* Clean up the DMA queue */
spin_lock(&ctrl->queue_lock);
cnt = rsxx_cleanup_dma_queue(card, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
if (cnt)
dev_info(CARD_TO_DEV(card),
"Freed %d queued DMAs on channel %d\n",
cnt, i);
/* Clean up issued DMAs */
for (j = 0; j < RSXX_MAX_OUTSTANDING_CMDS; j++) {
dma = get_tracker_dma(ctrl->trackers, j);
if (dma) {
pci_unmap_page(card->dev, dma->dma_addr,
get_dma_size(dma),
(dma->cmd == HW_CMD_BLK_WRITE) ?
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
kmem_cache_free(rsxx_dma_pool, dma);
cnt++;
}
}
spin_lock_bh(&ctrl->queue_lock);
rsxx_cleanup_dma_queue(ctrl, &ctrl->queue);
spin_unlock_bh(&ctrl->queue_lock);
if (cnt)
dev_info(CARD_TO_DEV(card),
"Freed %d pending DMAs on channel %d\n",
cnt, i);
rsxx_dma_cancel(ctrl);
vfree(ctrl->trackers);
......@@ -1013,7 +1037,7 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
cnt++;
}
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
list_splice(&issued_dmas[i], &card->ctrl[i].queue);
atomic_sub(cnt, &card->ctrl[i].stats.hw_q_depth);
......@@ -1028,7 +1052,7 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
}
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
kfree(issued_dmas);
......@@ -1036,30 +1060,13 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
return 0;
}
void rsxx_eeh_cancel_dmas(struct rsxx_cardinfo *card)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int i;
for (i = 0; i < card->n_targets; i++) {
spin_lock(&card->ctrl[i].queue_lock);
list_for_each_entry_safe(dma, tmp, &card->ctrl[i].queue, list) {
list_del(&dma->list);
rsxx_complete_dma(&card->ctrl[i], dma, DMA_CANCELLED);
}
spin_unlock(&card->ctrl[i].queue_lock);
}
}
int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card)
{
struct rsxx_dma *dma;
int i;
for (i = 0; i < card->n_targets; i++) {
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
list_for_each_entry(dma, &card->ctrl[i].queue, list) {
dma->dma_addr = pci_map_page(card->dev, dma->page,
dma->pg_off, get_dma_size(dma),
......@@ -1067,12 +1074,12 @@ int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card)
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
if (!dma->dma_addr) {
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
kmem_cache_free(rsxx_dma_pool, dma);
return -ENOMEM;
}
}
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
return 0;
......
......@@ -39,6 +39,7 @@
#include <linux/vmalloc.h>
#include <linux/timer.h>
#include <linux/ioctl.h>
#include <linux/delay.h>
#include "rsxx.h"
#include "rsxx_cfg.h"
......@@ -114,6 +115,7 @@ struct rsxx_dma_ctrl {
struct timer_list activity_timer;
struct dma_tracker_list *trackers;
struct rsxx_dma_stats stats;
struct mutex work_lock;
};
struct rsxx_cardinfo {
......@@ -134,6 +136,7 @@ struct rsxx_cardinfo {
spinlock_t lock;
bool active;
struct creg_cmd *active_cmd;
struct workqueue_struct *creg_wq;
struct work_struct done_work;
struct list_head queue;
unsigned int q_depth;
......@@ -154,6 +157,7 @@ struct rsxx_cardinfo {
int buf_len;
} log;
struct workqueue_struct *event_wq;
struct work_struct event_work;
unsigned int state;
u64 size8;
......@@ -181,6 +185,8 @@ struct rsxx_cardinfo {
int n_targets;
struct rsxx_dma_ctrl *ctrl;
struct dentry *debugfs_dir;
};
enum rsxx_pci_regmap {
......@@ -283,6 +289,7 @@ enum rsxx_creg_addr {
CREG_ADD_CAPABILITIES = 0x80001050,
CREG_ADD_LOG = 0x80002000,
CREG_ADD_NUM_TARGETS = 0x80003000,
CREG_ADD_CRAM = 0xA0000000,
CREG_ADD_CONFIG = 0xB0000000,
};
......@@ -372,6 +379,8 @@ typedef void (*rsxx_dma_cb)(struct rsxx_cardinfo *card,
int rsxx_dma_setup(struct rsxx_cardinfo *card);
void rsxx_dma_destroy(struct rsxx_cardinfo *card);
int rsxx_dma_init(void);
int rsxx_cleanup_dma_queue(struct rsxx_dma_ctrl *ctrl, struct list_head *q);
int rsxx_dma_cancel(struct rsxx_dma_ctrl *ctrl);
void rsxx_dma_cleanup(void);
void rsxx_dma_queue_reset(struct rsxx_cardinfo *card);
int rsxx_dma_configure(struct rsxx_cardinfo *card);
......@@ -382,7 +391,6 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
void *cb_data);
int rsxx_hw_buffers_init(struct pci_dev *dev, struct rsxx_dma_ctrl *ctrl);
int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card);
void rsxx_eeh_cancel_dmas(struct rsxx_cardinfo *card);
int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card);
/***** cregs.c *****/
......
此差异已折叠。
......@@ -50,6 +50,19 @@
__func__, __LINE__, ##args)
/*
* This is the maximum number of segments that would be allowed in indirect
* requests. This value will also be passed to the frontend.
*/
#define MAX_INDIRECT_SEGMENTS 256
#define SEGS_PER_INDIRECT_FRAME \
(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
#define MAX_INDIRECT_PAGES \
((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
#define INDIRECT_PAGES(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
/* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a
* compiler-checkable way to use common struct elements, so we can
......@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_32_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad1;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint64_t _pad2; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_32_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_32_request_rw rw;
struct blkif_x86_32_request_discard discard;
struct blkif_x86_32_request_other other;
struct blkif_x86_32_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_64_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint32_t _pad1; /* offsetof(blkif_..,u.indirect.id)==8 */
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint32_t _pad3; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_64_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_64_request_rw rw;
struct blkif_x86_64_request_discard discard;
struct blkif_x86_64_request_other other;
struct blkif_x86_64_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -182,12 +234,26 @@ struct xen_vbd {
struct backend_info;
/* Number of available flags */
#define PERSISTENT_GNT_FLAGS_SIZE 2
/* This persistent grant is currently in use */
#define PERSISTENT_GNT_ACTIVE 0
/*
* This persistent grant has been used, this flag is set when we remove the
* PERSISTENT_GNT_ACTIVE, to know that this grant has been used recently.
*/
#define PERSISTENT_GNT_WAS_ACTIVE 1
/* Number of requests that we can fit in a ring */
#define XEN_BLKIF_REQS 32
struct persistent_gnt {
struct page *page;
grant_ref_t gnt;
grant_handle_t handle;
DECLARE_BITMAP(flags, PERSISTENT_GNT_FLAGS_SIZE);
struct rb_node node;
struct list_head remove_node;
};
struct xen_blkif {
......@@ -219,6 +285,23 @@ struct xen_blkif {
/* tree to store persistent grants */
struct rb_root persistent_gnts;
unsigned int persistent_gnt_c;
atomic_t persistent_gnt_in_use;
unsigned long next_lru;
/* used by the kworker that offload work from the persistent purge */
struct list_head persistent_purge_list;
struct work_struct persistent_purge_work;
/* buffer of free pages to map grant refs */
spinlock_t free_pages_lock;
int free_pages_num;
struct list_head free_pages;
/* List of all 'pending_req' available */
struct list_head pending_free;
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
/* statistics */
unsigned long st_print;
......@@ -231,6 +314,41 @@ struct xen_blkif {
unsigned long long st_wr_sect;
wait_queue_head_t waiting_to_free;
/* Thread shutdown wait queue. */
wait_queue_head_t shutdown_wq;
};
struct seg_buf {
unsigned long offset;
unsigned int nsec;
};
struct grant_page {
struct page *page;
struct persistent_gnt *persistent_gnt;
grant_handle_t handle;
grant_ref_t gref;
};
/*
* Each outstanding request that we've passed to the lower device layers has a
* 'pending_req' allocated to it. Each buffer_head that completes decrements
* the pendcnt towards zero. When it hits zero, the specified domain has a
* response queued for it, with the saved 'id' passed back.
*/
struct pending_req {
struct xen_blkif *blkif;
u64 id;
int nr_pages;
atomic_t pendcnt;
unsigned short operation;
int status;
struct list_head free_list;
struct grant_page *segments[MAX_INDIRECT_SEGMENTS];
/* Indirect descriptors */
struct grant_page *indirect_pages[MAX_INDIRECT_PAGES];
struct seg_buf seg[MAX_INDIRECT_SEGMENTS];
struct bio *biolist[MAX_INDIRECT_SEGMENTS];
};
......@@ -257,6 +375,7 @@ int xen_blkif_xenbus_init(void);
irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
int xen_blkif_purge_persistent(void *arg);
int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state);
......@@ -268,7 +387,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
static inline void blkif_get_x86_32_req(struct blkif_request *dst,
struct blkif_x86_32_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -291,6 +410,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......@@ -304,7 +435,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
static inline void blkif_get_x86_64_req(struct blkif_request *dst,
struct blkif_x86_64_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -327,6 +458,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......
......@@ -98,12 +98,17 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
err = PTR_ERR(blkif->xenblkd);
blkif->xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
return;
}
}
static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{
struct xen_blkif *blkif;
struct pending_req *req, *n;
int i, j;
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif)
......@@ -118,8 +123,57 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free);
blkif->persistent_gnts.rb_node = NULL;
spin_lock_init(&blkif->free_pages_lock);
INIT_LIST_HEAD(&blkif->free_pages);
blkif->free_pages_num = 0;
atomic_set(&blkif->persistent_gnt_in_use, 0);
INIT_LIST_HEAD(&blkif->pending_free);
for (i = 0; i < XEN_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
list_add_tail(&req->free_list,
&blkif->pending_free);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
req->segments[j] = kzalloc(sizeof(*req->segments[0]),
GFP_KERNEL);
if (!req->segments[j])
goto fail;
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
GFP_KERNEL);
if (!req->indirect_pages[j])
goto fail;
}
}
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
return blkif;
fail:
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
if (!req->segments[j])
break;
kfree(req->segments[j]);
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
if (!req->indirect_pages[j])
break;
kfree(req->indirect_pages[j]);
}
kfree(req);
}
kmem_cache_free(xen_blkif_cachep, blkif);
return ERR_PTR(-ENOMEM);
}
static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
......@@ -178,6 +232,7 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
{
if (blkif->xenblkd) {
kthread_stop(blkif->xenblkd);
wake_up(&blkif->shutdown_wq);
blkif->xenblkd = NULL;
}
......@@ -198,8 +253,28 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
static void xen_blkif_free(struct xen_blkif *blkif)
{
struct pending_req *req, *n;
int i = 0, j;
if (!atomic_dec_and_test(&blkif->refcnt))
BUG();
/* Check that there is no request in use */
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
kfree(req->segments[j]);
for (j = 0; j < MAX_INDIRECT_PAGES; j++)
kfree(req->indirect_pages[j]);
kfree(req);
i++;
}
WARN_ON(i != XEN_BLKIF_REQS);
kmem_cache_free(xen_blkif_cachep, blkif);
}
......@@ -678,6 +753,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
MAX_INDIRECT_SEGMENTS);
if (err)
dev_warn(&dev->dev, "writing %s/feature-max-indirect-segments (%d)",
dev->nodename, err);
err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
(unsigned long long)vbd_sz(&be->blkif->vbd));
......@@ -704,6 +784,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "physical-sector-size", "%u",
bdev_physical_block_size(be->blkif->vbd.bdev));
if (err)
xenbus_dev_error(dev, err, "writing %s/physical-sector-size",
dev->nodename);
err = xenbus_transaction_end(xbt, 0);
if (err == -EAGAIN)
......
此差异已折叠。
......@@ -63,7 +63,10 @@
#include "bcache.h"
#include "btree.h"
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/random.h>
#include <trace/events/bcache.h>
#define MAX_IN_FLIGHT_DISCARDS 8U
......@@ -151,7 +154,7 @@ static void discard_finish(struct work_struct *w)
mutex_unlock(&ca->set->bucket_lock);
closure_wake_up(&ca->set->bucket_wait);
wake_up(&ca->set->alloc_wait);
wake_up_process(ca->alloc_thread);
closure_put(&ca->set->cl);
}
......@@ -350,38 +353,30 @@ static void invalidate_buckets(struct cache *ca)
break;
}
pr_debug("free %zu/%zu free_inc %zu/%zu unused %zu/%zu",
fifo_used(&ca->free), ca->free.size,
fifo_used(&ca->free_inc), ca->free_inc.size,
fifo_used(&ca->unused), ca->unused.size);
trace_bcache_alloc_invalidate(ca);
}
#define allocator_wait(ca, cond) \
do { \
DEFINE_WAIT(__wait); \
\
while (1) { \
prepare_to_wait(&ca->set->alloc_wait, \
&__wait, TASK_INTERRUPTIBLE); \
set_current_state(TASK_INTERRUPTIBLE); \
if (cond) \
break; \
\
mutex_unlock(&(ca)->set->bucket_lock); \
if (test_bit(CACHE_SET_STOPPING_2, &ca->set->flags)) { \
finish_wait(&ca->set->alloc_wait, &__wait); \
closure_return(cl); \
} \
if (kthread_should_stop()) \
return 0; \
\
try_to_freeze(); \
schedule(); \
mutex_lock(&(ca)->set->bucket_lock); \
} \
\
finish_wait(&ca->set->alloc_wait, &__wait); \
__set_current_state(TASK_RUNNING); \
} while (0)
void bch_allocator_thread(struct closure *cl)
static int bch_allocator_thread(void *arg)
{
struct cache *ca = container_of(cl, struct cache, alloc);
struct cache *ca = arg;
mutex_lock(&ca->set->bucket_lock);
......@@ -442,7 +437,7 @@ long bch_bucket_alloc(struct cache *ca, unsigned watermark, struct closure *cl)
{
long r = -1;
again:
wake_up(&ca->set->alloc_wait);
wake_up_process(ca->alloc_thread);
if (fifo_used(&ca->free) > ca->watermark[watermark] &&
fifo_pop(&ca->free, r)) {
......@@ -476,9 +471,7 @@ long bch_bucket_alloc(struct cache *ca, unsigned watermark, struct closure *cl)
return r;
}
pr_debug("alloc failure: blocked %i free %zu free_inc %zu unused %zu",
atomic_read(&ca->set->prio_blocked), fifo_used(&ca->free),
fifo_used(&ca->free_inc), fifo_used(&ca->unused));
trace_bcache_alloc_fail(ca);
if (cl) {
closure_wait(&ca->set->bucket_wait, cl);
......@@ -552,6 +545,17 @@ int bch_bucket_alloc_set(struct cache_set *c, unsigned watermark,
/* Init */
int bch_cache_allocator_start(struct cache *ca)
{
struct task_struct *k = kthread_run(bch_allocator_thread,
ca, "bcache_allocator");
if (IS_ERR(k))
return PTR_ERR(k);
ca->alloc_thread = k;
return 0;
}
void bch_cache_allocator_exit(struct cache *ca)
{
struct discard *d;
......
......@@ -178,7 +178,6 @@
#define pr_fmt(fmt) "bcache: %s() " fmt "\n", __func__
#include <linux/bio.h>
#include <linux/blktrace_api.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
......@@ -388,8 +387,6 @@ struct keybuf_key {
typedef bool (keybuf_pred_fn)(struct keybuf *, struct bkey *);
struct keybuf {
keybuf_pred_fn *key_predicate;
struct bkey last_scanned;
spinlock_t lock;
......@@ -437,9 +434,12 @@ struct bcache_device {
/* If nonzero, we're detaching/unregistering from cache set */
atomic_t detaching;
int flush_done;
uint64_t nr_stripes;
unsigned stripe_size_bits;
atomic_t *stripe_sectors_dirty;
atomic_long_t sectors_dirty;
unsigned long sectors_dirty_gc;
unsigned long sectors_dirty_last;
long sectors_dirty_derivative;
......@@ -531,6 +531,7 @@ struct cached_dev {
unsigned sequential_merge:1;
unsigned verify:1;
unsigned partial_stripes_expensive:1;
unsigned writeback_metadata:1;
unsigned writeback_running:1;
unsigned char writeback_percent;
......@@ -565,8 +566,7 @@ struct cache {
unsigned watermark[WATERMARK_MAX];
struct closure alloc;
struct workqueue_struct *alloc_workqueue;
struct task_struct *alloc_thread;
struct closure prio;
struct prio_set *disk_buckets;
......@@ -664,13 +664,9 @@ struct gc_stat {
* CACHE_SET_STOPPING always gets set first when we're closing down a cache set;
* we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e.
* flushing dirty data).
*
* CACHE_SET_STOPPING_2 gets set at the last phase, when it's time to shut down
* the allocation thread.
*/
#define CACHE_SET_UNREGISTERING 0
#define CACHE_SET_STOPPING 1
#define CACHE_SET_STOPPING_2 2
struct cache_set {
struct closure cl;
......@@ -703,9 +699,6 @@ struct cache_set {
/* For the btree cache */
struct shrinker shrink;
/* For the allocator itself */
wait_queue_head_t alloc_wait;
/* For the btree cache and anything allocation related */
struct mutex bucket_lock;
......@@ -823,10 +816,9 @@ struct cache_set {
/*
* A btree node on disk could have too many bsets for an iterator to fit
* on the stack - this is a single element mempool for btree_read_work()
* on the stack - have to dynamically allocate them
*/
struct mutex fill_lock;
struct btree_iter *fill_iter;
mempool_t *fill_iter;
/*
* btree_sort() is a merge sort and requires temporary space - single
......@@ -834,6 +826,7 @@ struct cache_set {
*/
struct mutex sort_lock;
struct bset *sort;
unsigned sort_crit_factor;
/* List of buckets we're currently writing data to */
struct list_head data_buckets;
......@@ -906,8 +899,6 @@ static inline unsigned local_clock_us(void)
return local_clock() >> 10;
}
#define MAX_BSETS 4U
#define BTREE_PRIO USHRT_MAX
#define INITIAL_PRIO 32768
......@@ -1112,23 +1103,6 @@ static inline void __bkey_put(struct cache_set *c, struct bkey *k)
atomic_dec_bug(&PTR_BUCKET(c, k, i)->pin);
}
/* Blktrace macros */
#define blktrace_msg(c, fmt, ...) \
do { \
struct request_queue *q = bdev_get_queue(c->bdev); \
if (q) \
blk_add_trace_msg(q, fmt, ##__VA_ARGS__); \
} while (0)
#define blktrace_msg_all(s, fmt, ...) \
do { \
struct cache *_c; \
unsigned i; \
for_each_cache(_c, (s), i) \
blktrace_msg(_c, fmt, ##__VA_ARGS__); \
} while (0)
static inline void cached_dev_put(struct cached_dev *dc)
{
if (atomic_dec_and_test(&dc->count))
......@@ -1173,10 +1147,16 @@ static inline uint8_t bucket_disk_gen(struct bucket *b)
static struct kobj_attribute ksysfs_##n = \
__ATTR(n, S_IWUSR|S_IRUSR, show, store)
/* Forward declarations */
static inline void wake_up_allocators(struct cache_set *c)
{
struct cache *ca;
unsigned i;
for_each_cache(ca, c, i)
wake_up_process(ca->alloc_thread);
}
void bch_writeback_queue(struct cached_dev *);
void bch_writeback_add(struct cached_dev *, unsigned);
/* Forward declarations */
void bch_count_io_errors(struct cache *, int, const char *);
void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
......@@ -1193,7 +1173,6 @@ void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
uint8_t bch_inc_gen(struct cache *, struct bucket *);
void bch_rescale_priorities(struct cache_set *, int);
bool bch_bucket_add_unused(struct cache *, struct bucket *);
void bch_allocator_thread(struct closure *);
long bch_bucket_alloc(struct cache *, unsigned, struct closure *);
void bch_bucket_free(struct cache_set *, struct bkey *);
......@@ -1241,9 +1220,9 @@ void bch_cache_set_stop(struct cache_set *);
struct cache_set *bch_cache_set_alloc(struct cache_sb *);
void bch_btree_cache_free(struct cache_set *);
int bch_btree_cache_alloc(struct cache_set *);
void bch_cached_dev_writeback_init(struct cached_dev *);
void bch_moving_init_cache_set(struct cache_set *);
int bch_cache_allocator_start(struct cache *ca);
void bch_cache_allocator_exit(struct cache *ca);
int bch_cache_allocator_init(struct cache *ca);
......
......@@ -78,6 +78,7 @@ struct bkey *bch_keylist_pop(struct keylist *l)
bool __bch_ptr_invalid(struct cache_set *c, int level, const struct bkey *k)
{
unsigned i;
char buf[80];
if (level && (!KEY_PTRS(k) || !KEY_SIZE(k) || KEY_DIRTY(k)))
goto bad;
......@@ -102,7 +103,8 @@ bool __bch_ptr_invalid(struct cache_set *c, int level, const struct bkey *k)
return false;
bad:
cache_bug(c, "spotted bad key %s: %s", pkey(k), bch_ptr_status(c, k));
bch_bkey_to_text(buf, sizeof(buf), k);
cache_bug(c, "spotted bad key %s: %s", buf, bch_ptr_status(c, k));
return true;
}
......@@ -162,10 +164,16 @@ bool bch_ptr_bad(struct btree *b, const struct bkey *k)
#ifdef CONFIG_BCACHE_EDEBUG
bug:
mutex_unlock(&b->c->bucket_lock);
btree_bug(b,
{
char buf[80];
bch_bkey_to_text(buf, sizeof(buf), k);
btree_bug(b,
"inconsistent pointer %s: bucket %zu pin %i prio %i gen %i last_gc %i mark %llu gc_gen %i",
pkey(k), PTR_BUCKET_NR(b->c, k, i), atomic_read(&g->pin),
g->prio, g->gen, g->last_gc, GC_MARK(g), g->gc_gen);
buf, PTR_BUCKET_NR(b->c, k, i), atomic_read(&g->pin),
g->prio, g->gen, g->last_gc, GC_MARK(g), g->gc_gen);
}
return true;
#endif
}
......@@ -1084,33 +1092,39 @@ void bch_btree_sort_into(struct btree *b, struct btree *new)
new->sets->size = 0;
}
#define SORT_CRIT (4096 / sizeof(uint64_t))
void bch_btree_sort_lazy(struct btree *b)
{
if (b->nsets) {
unsigned i, j, keys = 0, total;
unsigned crit = SORT_CRIT;
int i;
for (i = 0; i <= b->nsets; i++)
keys += b->sets[i].data->keys;
total = keys;
/* Don't sort if nothing to do */
if (!b->nsets)
goto out;
for (j = 0; j < b->nsets; j++) {
if (keys * 2 < total ||
keys < 1000) {
bch_btree_sort_partial(b, j);
return;
}
/* If not a leaf node, always sort */
if (b->level) {
bch_btree_sort(b);
return;
}
keys -= b->sets[j].data->keys;
}
for (i = b->nsets - 1; i >= 0; --i) {
crit *= b->c->sort_crit_factor;
/* Must sort if b->nsets == 3 or we'll overflow */
if (b->nsets >= (MAX_BSETS - 1) - b->level) {
bch_btree_sort(b);
if (b->sets[i].data->keys < crit) {
bch_btree_sort_partial(b, i);
return;
}
}
/* Sort if we'd overflow */
if (b->nsets + 1 == MAX_BSETS) {
bch_btree_sort(b);
return;
}
out:
bset_build_written_tree(b);
}
......
#ifndef _BCACHE_BSET_H
#define _BCACHE_BSET_H
#include <linux/slab.h>
/*
* BKEYS:
*
......@@ -142,6 +144,8 @@
/* Btree key comparison/iteration */
#define MAX_BSETS 4U
struct btree_iter {
size_t size, used;
struct btree_iter_set {
......
此差异已折叠。
......@@ -102,7 +102,6 @@
#include "debug.h"
struct btree_write {
struct closure *owner;
atomic_t *journal;
/* If btree_split() frees a btree node, it writes a new pointer to that
......@@ -142,16 +141,12 @@ struct btree {
*/
struct bset_tree sets[MAX_BSETS];
/* Used to refcount bio splits, also protects b->bio */
/* For outstanding btree writes, used as a lock - protects write_idx */
struct closure_with_waitlist io;
/* Gets transferred to w->prio_blocked - see the comment there */
int prio_blocked;
struct list_head list;
struct delayed_work work;
uint64_t io_start_time;
struct btree_write writes[2];
struct bio *bio;
};
......@@ -164,13 +159,11 @@ static inline void set_btree_node_ ## flag(struct btree *b) \
{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
enum btree_flags {
BTREE_NODE_read_done,
BTREE_NODE_io_error,
BTREE_NODE_dirty,
BTREE_NODE_write_idx,
};
BTREE_FLAG(read_done);
BTREE_FLAG(io_error);
BTREE_FLAG(dirty);
BTREE_FLAG(write_idx);
......@@ -278,6 +271,13 @@ struct btree_op {
BKEY_PADDED(replace);
};
enum {
BTREE_INSERT_STATUS_INSERT,
BTREE_INSERT_STATUS_BACK_MERGE,
BTREE_INSERT_STATUS_OVERWROTE,
BTREE_INSERT_STATUS_FRONT_MERGE,
};
void bch_btree_op_init_stack(struct btree_op *);
static inline void rw_lock(bool w, struct btree *b, int level)
......@@ -293,9 +293,7 @@ static inline void rw_unlock(bool w, struct btree *b)
#ifdef CONFIG_BCACHE_EDEBUG
unsigned i;
if (w &&
b->key.ptr[0] &&
btree_node_read_done(b))
if (w && b->key.ptr[0])
for (i = 0; i <= b->nsets; i++)
bch_check_key_order(b, b->sets[i].data);
#endif
......@@ -370,9 +368,8 @@ static inline bool should_split(struct btree *b)
> btree_blocks(b));
}
void bch_btree_read_done(struct closure *);
void bch_btree_read(struct btree *);
void bch_btree_write(struct btree *b, bool now, struct btree_op *op);
void bch_btree_node_read(struct btree *);
void bch_btree_node_write(struct btree *, struct closure *);
void bch_cannibalize_unlock(struct cache_set *, struct closure *);
void bch_btree_set_root(struct btree *);
......@@ -380,7 +377,6 @@ struct btree *bch_btree_node_alloc(struct cache_set *, int, struct closure *);
struct btree *bch_btree_node_get(struct cache_set *, struct bkey *,
int, struct btree_op *);
bool bch_btree_insert_keys(struct btree *, struct btree_op *);
bool bch_btree_insert_check_key(struct btree *, struct btree_op *,
struct bio *);
int bch_btree_insert(struct btree_op *, struct cache_set *);
......@@ -393,13 +389,14 @@ void bch_moving_gc(struct closure *);
int bch_btree_check(struct cache_set *, struct btree_op *);
uint8_t __bch_btree_mark_key(struct cache_set *, int, struct bkey *);
void bch_keybuf_init(struct keybuf *, keybuf_pred_fn *);
void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *);
void bch_keybuf_init(struct keybuf *);
void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *,
keybuf_pred_fn *);
bool bch_keybuf_check_overlapping(struct keybuf *, struct bkey *,
struct bkey *);
void bch_keybuf_del(struct keybuf *, struct keybuf_key *);
struct keybuf_key *bch_keybuf_next(struct keybuf *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *,
struct keybuf *, struct bkey *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *, struct keybuf *,
struct bkey *, keybuf_pred_fn *);
#endif
......@@ -66,16 +66,18 @@ static inline void closure_put_after_sub(struct closure *cl, int flags)
} else {
struct closure *parent = cl->parent;
struct closure_waitlist *wait = closure_waitlist(cl);
closure_fn *destructor = cl->fn;
closure_debug_destroy(cl);
smp_mb();
atomic_set(&cl->remaining, -1);
if (wait)
closure_wake_up(wait);
if (cl->fn)
cl->fn(cl);
if (destructor)
destructor(cl);
if (parent)
closure_put(parent);
......
......@@ -47,11 +47,10 @@ const char *bch_ptr_status(struct cache_set *c, const struct bkey *k)
return "";
}
struct keyprint_hack bch_pkey(const struct bkey *k)
int bch_bkey_to_text(char *buf, size_t size, const struct bkey *k)
{
unsigned i = 0;
struct keyprint_hack r;
char *out = r.s, *end = r.s + KEYHACK_SIZE;
char *out = buf, *end = buf + size;
#define p(...) (out += scnprintf(out, end - out, __VA_ARGS__))
......@@ -75,16 +74,14 @@ struct keyprint_hack bch_pkey(const struct bkey *k)
if (KEY_CSUM(k))
p(" cs%llu %llx", KEY_CSUM(k), k->ptr[1]);
#undef p
return r;
return out - buf;
}
struct keyprint_hack bch_pbtree(const struct btree *b)
int bch_btree_to_text(char *buf, size_t size, const struct btree *b)
{
struct keyprint_hack r;
snprintf(r.s, 40, "%zu level %i/%i", PTR_BUCKET_NR(b->c, &b->key, 0),
b->level, b->c->root ? b->c->root->level : -1);
return r;
return scnprintf(buf, size, "%zu level %i/%i",
PTR_BUCKET_NR(b->c, &b->key, 0),
b->level, b->c->root ? b->c->root->level : -1);
}
#if defined(CONFIG_BCACHE_DEBUG) || defined(CONFIG_BCACHE_EDEBUG)
......@@ -100,10 +97,12 @@ static void dump_bset(struct btree *b, struct bset *i)
{
struct bkey *k;
unsigned j;
char buf[80];
for (k = i->start; k < end(i); k = bkey_next(k)) {
bch_bkey_to_text(buf, sizeof(buf), k);
printk(KERN_ERR "block %zu key %zi/%u: %s", index(i, b),
(uint64_t *) k - i->d, i->keys, pkey(k));
(uint64_t *) k - i->d, i->keys, buf);
for (j = 0; j < KEY_PTRS(k); j++) {
size_t n = PTR_BUCKET_NR(b->c, k, j);
......@@ -144,7 +143,7 @@ void bch_btree_verify(struct btree *b, struct bset *new)
v->written = 0;
v->level = b->level;
bch_btree_read(v);
bch_btree_node_read(v);
closure_wait_event(&v->io.wait, &cl,
atomic_read(&b->io.cl.remaining) == -1);
......@@ -200,7 +199,7 @@ void bch_data_verify(struct search *s)
if (!check)
return;
if (bch_bio_alloc_pages(check, GFP_NOIO))
if (bio_alloc_pages(check, GFP_NOIO))
goto out_put;
check->bi_rw = READ_SYNC;
......@@ -252,6 +251,7 @@ static void vdump_bucket_and_panic(struct btree *b, const char *fmt,
va_list args)
{
unsigned i;
char buf[80];
console_lock();
......@@ -262,7 +262,8 @@ static void vdump_bucket_and_panic(struct btree *b, const char *fmt,
console_unlock();
panic("at %s\n", pbtree(b));
bch_btree_to_text(buf, sizeof(buf), b);
panic("at %s\n", buf);
}
void bch_check_key_order_msg(struct btree *b, struct bset *i,
......@@ -337,6 +338,7 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf,
{
struct dump_iterator *i = file->private_data;
ssize_t ret = 0;
char kbuf[80];
while (size) {
struct keybuf_key *w;
......@@ -355,11 +357,12 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf,
if (i->bytes)
break;
w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY);
w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY, dump_pred);
if (!w)
break;
i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", pkey(&w->key));
bch_bkey_to_text(kbuf, sizeof(kbuf), &w->key);
i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", kbuf);
bch_keybuf_del(&i->keys, w);
}
......@@ -377,7 +380,7 @@ static int bch_dump_open(struct inode *inode, struct file *file)
file->private_data = i;
i->c = c;
bch_keybuf_init(&i->keys, dump_pred);
bch_keybuf_init(&i->keys);
i->keys.last_scanned = KEY(0, 0, 0);
return 0;
......@@ -409,142 +412,6 @@ void bch_debug_init_cache_set(struct cache_set *c)
#endif
/* Fuzz tester has rotted: */
#if 0
static ssize_t btree_fuzz(struct kobject *k, struct kobj_attribute *a,
const char *buffer, size_t size)
{
void dump(struct btree *b)
{
struct bset *i;
for (i = b->sets[0].data;
index(i, b) < btree_blocks(b) &&
i->seq == b->sets[0].data->seq;
i = ((void *) i) + set_blocks(i, b->c) * block_bytes(b->c))
dump_bset(b, i);
}
struct cache_sb *sb;
struct cache_set *c;
struct btree *all[3], *b, *fill, *orig;
int j;
struct btree_op op;
bch_btree_op_init_stack(&op);
sb = kzalloc(sizeof(struct cache_sb), GFP_KERNEL);
if (!sb)
return -ENOMEM;
sb->bucket_size = 128;
sb->block_size = 4;
c = bch_cache_set_alloc(sb);
if (!c)
return -ENOMEM;
for (j = 0; j < 3; j++) {
BUG_ON(list_empty(&c->btree_cache));
all[j] = list_first_entry(&c->btree_cache, struct btree, list);
list_del_init(&all[j]->list);
all[j]->key = KEY(0, 0, c->sb.bucket_size);
bkey_copy_key(&all[j]->key, &MAX_KEY);
}
b = all[0];
fill = all[1];
orig = all[2];
while (1) {
for (j = 0; j < 3; j++)
all[j]->written = all[j]->nsets = 0;
bch_bset_init_next(b);
while (1) {
struct bset *i = write_block(b);
struct bkey *k = op.keys.top;
unsigned rand;
bkey_init(k);
rand = get_random_int();
op.type = rand & 1
? BTREE_INSERT
: BTREE_REPLACE;
rand >>= 1;
SET_KEY_SIZE(k, bucket_remainder(c, rand));
rand >>= c->bucket_bits;
rand &= 1024 * 512 - 1;
rand += c->sb.bucket_size;
SET_KEY_OFFSET(k, rand);
#if 0
SET_KEY_PTRS(k, 1);
#endif
bch_keylist_push(&op.keys);
bch_btree_insert_keys(b, &op);
if (should_split(b) ||
set_blocks(i, b->c) !=
__set_blocks(i, i->keys + 15, b->c)) {
i->csum = csum_set(i);
memcpy(write_block(fill),
i, set_bytes(i));
b->written += set_blocks(i, b->c);
fill->written = b->written;
if (b->written == btree_blocks(b))
break;
bch_btree_sort_lazy(b);
bch_bset_init_next(b);
}
}
memcpy(orig->sets[0].data,
fill->sets[0].data,
btree_bytes(c));
bch_btree_sort(b);
fill->written = 0;
bch_btree_read_done(&fill->io.cl);
if (b->sets[0].data->keys != fill->sets[0].data->keys ||
memcmp(b->sets[0].data->start,
fill->sets[0].data->start,
b->sets[0].data->keys * sizeof(uint64_t))) {
struct bset *i = b->sets[0].data;
struct bkey *k, *l;
for (k = i->start,
l = fill->sets[0].data->start;
k < end(i);
k = bkey_next(k), l = bkey_next(l))
if (bkey_cmp(k, l) ||
KEY_SIZE(k) != KEY_SIZE(l))
pr_err("key %zi differs: %s != %s",
(uint64_t *) k - i->d,
pkey(k), pkey(l));
for (j = 0; j < 3; j++) {
pr_err("**** Set %i ****", j);
dump(all[j]);
}
panic("\n");
}
pr_info("fuzz complete: %i keys", b->sets[0].data->keys);
}
}
kobj_attribute_write(fuzz, btree_fuzz);
#endif
void bch_debug_exit(void)
{
if (!IS_ERR_OR_NULL(debug))
......@@ -554,11 +421,6 @@ void bch_debug_exit(void)
int __init bch_debug_init(struct kobject *kobj)
{
int ret = 0;
#if 0
ret = sysfs_create_file(kobj, &ksysfs_fuzz.attr);
if (ret)
return ret;
#endif
debug = debugfs_create_dir("bcache", NULL);
return ret;
......
......@@ -3,15 +3,8 @@
/* Btree/bkey debug printing */
#define KEYHACK_SIZE 80
struct keyprint_hack {
char s[KEYHACK_SIZE];
};
struct keyprint_hack bch_pkey(const struct bkey *k);
struct keyprint_hack bch_pbtree(const struct btree *b);
#define pkey(k) (&bch_pkey(k).s[0])
#define pbtree(b) (&bch_pbtree(b).s[0])
int bch_bkey_to_text(char *buf, size_t size, const struct bkey *k);
int bch_btree_to_text(char *buf, size_t size, const struct btree *b);
#ifdef CONFIG_BCACHE_EDEBUG
......
......@@ -9,6 +9,8 @@
#include "bset.h"
#include "debug.h"
#include <linux/blkdev.h>
static void bch_bi_idx_hack_endio(struct bio *bio, int error)
{
struct bio *p = bio->bi_private;
......@@ -66,13 +68,6 @@ static void bch_generic_make_request_hack(struct bio *bio)
* The newly allocated bio will point to @bio's bi_io_vec, if the split was on a
* bvec boundry; it is the caller's responsibility to ensure that @bio is not
* freed before the split.
*
* If bch_bio_split() is running under generic_make_request(), it's not safe to
* allocate more than one bio from the same bio set. Therefore, if it is running
* under generic_make_request() it masks out __GFP_WAIT when doing the
* allocation. The caller must check for failure if there's any possibility of
* it being called from under generic_make_request(); it is then the caller's
* responsibility to retry from a safe context (by e.g. punting to workqueue).
*/
struct bio *bch_bio_split(struct bio *bio, int sectors,
gfp_t gfp, struct bio_set *bs)
......@@ -83,20 +78,13 @@ struct bio *bch_bio_split(struct bio *bio, int sectors,
BUG_ON(sectors <= 0);
/*
* If we're being called from underneath generic_make_request() and we
* already allocated any bios from this bio set, we risk deadlock if we
* use the mempool. So instead, we possibly fail and let the caller punt
* to workqueue or somesuch and retry in a safe context.
*/
if (current->bio_list)
gfp &= ~__GFP_WAIT;
if (sectors >= bio_sectors(bio))
return bio;
if (bio->bi_rw & REQ_DISCARD) {
ret = bio_alloc_bioset(gfp, 1, bs);
if (!ret)
return NULL;
idx = 0;
goto out;
}
......@@ -160,17 +148,18 @@ static unsigned bch_bio_max_sectors(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
unsigned max_segments = min_t(unsigned, BIO_MAX_PAGES,
queue_max_segments(q));
struct bio_vec *bv, *end = bio_iovec(bio) +
min_t(int, bio_segments(bio), max_segments);
if (bio->bi_rw & REQ_DISCARD)
return min(ret, q->limits.max_discard_sectors);
if (bio_segments(bio) > max_segments ||
q->merge_bvec_fn) {
struct bio_vec *bv;
int i, seg = 0;
ret = 0;
for (bv = bio_iovec(bio); bv < end; bv++) {
bio_for_each_segment(bv, bio, i) {
struct bvec_merge_data bvm = {
.bi_bdev = bio->bi_bdev,
.bi_sector = bio->bi_sector,
......@@ -178,10 +167,14 @@ static unsigned bch_bio_max_sectors(struct bio *bio)
.bi_rw = bio->bi_rw,
};
if (seg == max_segments)
break;
if (q->merge_bvec_fn &&
q->merge_bvec_fn(q, &bvm, bv) < (int) bv->bv_len)
break;
seg++;
ret += bv->bv_len >> 9;
}
}
......@@ -218,30 +211,10 @@ static void bch_bio_submit_split_endio(struct bio *bio, int error)
closure_put(cl);
}
static void __bch_bio_submit_split(struct closure *cl)
{
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
struct bio *bio = s->bio, *n;
do {
n = bch_bio_split(bio, bch_bio_max_sectors(bio),
GFP_NOIO, s->p->bio_split);
if (!n)
continue_at(cl, __bch_bio_submit_split, system_wq);
n->bi_end_io = bch_bio_submit_split_endio;
n->bi_private = cl;
closure_get(cl);
bch_generic_make_request_hack(n);
} while (n != bio);
continue_at(cl, bch_bio_submit_split_done, NULL);
}
void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
{
struct bio_split_hook *s;
struct bio *n;
if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
goto submit;
......@@ -250,6 +223,7 @@ void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
goto submit;
s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
closure_init(&s->cl, NULL);
s->bio = bio;
s->p = p;
......@@ -257,8 +231,18 @@ void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
s->bi_private = bio->bi_private;
bio_get(bio);
closure_call(&s->cl, __bch_bio_submit_split, NULL, NULL);
return;
do {
n = bch_bio_split(bio, bch_bio_max_sectors(bio),
GFP_NOIO, s->p->bio_split);
n->bi_end_io = bch_bio_submit_split_endio;
n->bi_private = &s->cl;
closure_get(&s->cl);
bch_generic_make_request_hack(n);
} while (n != bio);
continue_at(&s->cl, bch_bio_submit_split_done, NULL);
submit:
bch_generic_make_request_hack(bio);
}
......
......@@ -9,6 +9,8 @@
#include "debug.h"
#include "request.h"
#include <trace/events/bcache.h>
/*
* Journal replay/recovery:
*
......@@ -182,9 +184,14 @@ int bch_journal_read(struct cache_set *c, struct list_head *list,
pr_debug("starting binary search, l %u r %u", l, r);
while (l + 1 < r) {
seq = list_entry(list->prev, struct journal_replay,
list)->j.seq;
m = (l + r) >> 1;
read_bucket(m);
if (read_bucket(m))
if (seq != list_entry(list->prev, struct journal_replay,
list)->j.seq)
l = m;
else
r = m;
......@@ -300,7 +307,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list,
for (k = i->j.start;
k < end(&i->j);
k = bkey_next(k)) {
pr_debug("%s", pkey(k));
trace_bcache_journal_replay_key(k);
bkey_copy(op->keys.top, k);
bch_keylist_push(&op->keys);
......@@ -384,7 +392,7 @@ static void btree_flush_write(struct cache_set *c)
return;
found:
if (btree_node_dirty(best))
bch_btree_write(best, true, NULL);
bch_btree_node_write(best, NULL);
rw_unlock(true, best);
}
......@@ -617,7 +625,7 @@ static void journal_write_unlocked(struct closure *cl)
bio_reset(bio);
bio->bi_sector = PTR_OFFSET(k, i);
bio->bi_bdev = ca->bdev;
bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH;
bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA;
bio->bi_size = sectors << 9;
bio->bi_end_io = journal_write_endio;
......@@ -712,7 +720,8 @@ void bch_journal(struct closure *cl)
spin_lock(&c->journal.lock);
if (journal_full(&c->journal)) {
/* XXX: tracepoint */
trace_bcache_journal_full(c);
closure_wait(&c->journal.wait, cl);
journal_reclaim(c);
......@@ -728,13 +737,15 @@ void bch_journal(struct closure *cl)
if (b * c->sb.block_size > PAGE_SECTORS << JSET_BITS ||
b > c->journal.blocks_free) {
/* XXX: If we were inserting so many keys that they won't fit in
trace_bcache_journal_entry_full(c);
/*
* XXX: If we were inserting so many keys that they won't fit in
* an _empty_ journal write, we'll deadlock. For now, handle
* this in bch_keylist_realloc() - but something to think about.
*/
BUG_ON(!w->data->keys);
/* XXX: tracepoint */
BUG_ON(!closure_wait(&w->wait, cl));
closure_flush(&c->journal.io);
......
此差异已折叠。
此差异已折叠。
......@@ -30,7 +30,7 @@ struct search {
};
void bch_cache_read_endio(struct bio *, int);
int bch_get_congested(struct cache_set *);
unsigned bch_get_congested(struct cache_set *);
void bch_insert_data(struct closure *cl);
void bch_btree_insert_async(struct closure *);
void bch_cache_read_endio(struct bio *, int);
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -228,23 +228,6 @@ start: bv->bv_len = min_t(size_t, PAGE_SIZE - bv->bv_offset,
}
}
int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp)
{
int i;
struct bio_vec *bv;
bio_for_each_segment(bv, bio, i) {
bv->bv_page = alloc_page(gfp);
if (!bv->bv_page) {
while (bv-- != bio->bi_io_vec + bio->bi_idx)
__free_page(bv->bv_page);
return -ENOMEM;
}
}
return 0;
}
/*
* Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group (Any
* use permitted, subject to terms of PostgreSQL license; see.)
......
......@@ -15,8 +15,6 @@
struct closure;
#include <trace/events/bcache.h>
#ifdef CONFIG_BCACHE_EDEBUG
#define atomic_dec_bug(v) BUG_ON(atomic_dec_return(v) < 0)
......@@ -566,12 +564,8 @@ static inline unsigned fract_exp_two(unsigned x, unsigned fract_bits)
return x;
}
#define bio_end(bio) ((bio)->bi_sector + bio_sectors(bio))
void bch_bio_map(struct bio *bio, void *base);
int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp);
static inline sector_t bdev_sectors(struct block_device *bdev)
{
return bdev->bd_inode->i_size >> 9;
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -181,6 +181,8 @@ GENL_struct(DRBD_NLA_RESIZE_PARMS, 7, resize_parms,
__u64_field(1, DRBD_GENLA_F_MANDATORY, resize_size)
__flg_field(2, DRBD_GENLA_F_MANDATORY, resize_force)
__flg_field(3, DRBD_GENLA_F_MANDATORY, no_resync)
__u32_field_def(4, 0 /* OPTIONAL */, al_stripes, DRBD_AL_STRIPES_DEF)
__u32_field_def(5, 0 /* OPTIONAL */, al_stripe_size, DRBD_AL_STRIPE_SIZE_DEF)
)
GENL_struct(DRBD_NLA_STATE_INFO, 8, state_info,
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册