- 30 8月, 2017 2 次提交
-
-
由 Lars Ellenberg 提交于
In protocol != C, we forgot to send the P_NEG_ACK for failing writes. Once we no longer submit to local disk, because we already "detached", due to the typical "on-io-error detach;" config setting, we already send the neg acks right away. Only those requests that have been submitted, and have been error-completed by the local disk, would forget to send the neg-ack, and only in asynchronous replication (protocol != C). Unless this happened during resync, where we already always send acks, regardless of protocol. The primary side needs the P_NEG_ACK in order to mark the affected block(s) for resync in its out-of-sync bitmap. If the blocks in question are not re-written again, we may miss to resync them later, causing data inconsistencies. This patch will always send the neg-acks, and also at least try to persist the out-of-sync status on the local node already. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Lars Ellenberg 提交于
Recently, drbd_recv_header() was changed to potentially implicitly "unplug" the backend device(s), in case there is currently nothing to receive. Be more explicit about it: re-introduce the original drbd_recv_header(), and introduce a new drbd_recv_header_maybe_unplug() for use by the receiver "main loop". Using explicit plugging via blk_start_plug(); blk_finish_plug(); really helps the io-scheduler of the backend with merging requests. Wrap the receiver "main loop" with such a plug. Also catch unplug events on the Primary, and try to propagate. This is performance relevant. Without this, if the receiving side does not merge requests, number of IOPS on the peer can me significantly higher than IOPS on the Primary, and can easily become the bottleneck. Together, both changes should help to reduce the number of IOPS as seen on the backend of the receiving side, by increasing the chance of merging mergable requests, without trading latency for more throughput. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 8月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 6月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 09 4月, 2017 1 次提交
-
-
由 Christoph Hellwig 提交于
It seems like DRBD assumes its on the wire TRIM request always zeroes data. Use that fact to implement REQ_OP_WRITE_ZEROES. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 8月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Since commit 63a4cc24, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 21 7月, 2016 1 次提交
-
-
由 Christoph Hellwig 提交于
These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NMike Christie <mchristi@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 14 6月, 2016 5 次提交
-
-
由 Fabian Frederick 提交于
This contains various cosmetic fixes ranging from simple typos to const-ifying, and using booleans properly. Original commit messages from Fabian's patch set: drbd: debugfs: constify drbd_version_fops drbd: use seq_put instead of seq_print where possible drbd: include linux/uaccess.h instead of asm/uaccess.h drbd: use const char * const for drbd strings drbd: kerneldoc warning fix in w_e_end_data_req() drbd: use unsigned for one bit fields drbd: use bool for peer is_ states drbd: fix typo drbd: use | for bitmask combination drbd: use true/false for bool drbd: fix drbd_bm_init() comments drbd: introduce peer state union drbd: fix maybe_pull_ahead() locking comments drbd: use bool for growing drbd: remove redundant declarations drbd: replace if/BUG by BUG_ON Signed-off-by: NFabian Frederick <fabf@skynet.be> Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Lars Ellenberg 提交于
We will support WRITE_SAME, if * all peers support WRITE_SAME (both in kernel and DRBD version), * all peer devices support WRITE_SAME * logical_block_size is identical on all peers. We may at some point introduce a fallback on the receiving side for devices/kernels that do not support WRITE_SAME, by open-coding a submit loop. But not yet. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Lars Ellenberg 提交于
When resync is finished, we already call the "after-resync-target" handler (on the former sync target, obviously), once per volume. Paired with the before-resync-target handler, you can create snapshots, before the resync causes the volumes to become inconsistent, and discard those snapshots again, once they are no longer needed. It was also overloaded to be paired with the "fence-peer" handler, to "unfence" once the volumes are up-to-date and known good. This has some disadvantages, though: we call "fence-peer" for the whole connection (once for the group of volumes), but would call unfence as side-effect of after-resync-target once for each volume. Also, we fence on a (current, or about to become) Primary, which will later become the sync-source. Calling unfence only as a side effect of the after-resync-target handler opens a race window, between a new fence on the Primary (SyncTarget) and the unfence on the SyncTarget, which is difficult to close without some kind of "cluster wide lock" in those handlers. We would not need those handlers if we could still communicate. Which makes trying to aquire a cluster wide lock from those handlers seem like a very bad idea. This introduces the "unfence-peer" handler, which will be called per connection (once for the group of volumes), just like the fence handler, only once all volumes are back in sync, and on the SyncSource. Which is expected to be the node that previously called "fence", the node that is currently allowed to be Primary, and thus the only node that could trigger a new "fence" that could race with this unfence. Which makes us not need any cluster wide synchronization here, serializing two scripts running on the same node is trivial. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Philipp Reisner 提交于
If thinly provisioned volumes are used, during a resync the sync source tries to find out if a block is deallocated. If it is deallocated, then the resync target uses block_dev_issue_zeroout() on the range in question. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Philipp Reisner 提交于
If during resync we read only zeroes for a range of sectors assume that these secotors can be discarded on the sync target node. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 6月, 2016 1 次提交
-
-
由 Mike Christie 提交于
Separate the op from the rq_flag_bits and have drbd set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: NMike Christie <mchristi@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 27 1月, 2016 1 次提交
-
-
由 Herbert Xu 提交于
This patch replaces uses of the long obsolete hash interface with either shash (for non-SG users) or ahash. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
-
- 26 11月, 2015 5 次提交
-
-
由 Lars Ellenberg 提交于
lsblk should be able to pick up stacking device driver relations involving DRBD conveniently. Even though upstream kernel since 2011 says "DON'T USE THIS UNLESS YOU'RE ALREADY USING IT." a new user has been added since (bcache), which sets the precedences for us to use it as well. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Philipp Reisner 提交于
The intention is to reduce CPU utilization. Recent measurements unveiled that the current performance bottleneck is CPU utilization on the receiving node. The asender thread became CPU limited. One of the main points is to eliminate the idr_for_each_entry() loop from the sending acks code path. One exception in that is sending back ping_acks. These stay in the ack-receiver thread. Otherwise the logic becomes too complicated for no added value. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Lars Ellenberg 提交于
Don't blame the peer for being unresponsive, if we did not even ask the question yet. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Lars Ellenberg 提交于
The only way to make DRBD intentionally call panic is to set a disk timeout, have that trigger, "abort" some request and complete to upper layers, then have the backend IO subsystem later complete these requests successfully regardless. As the attached IO pages have been recycled for other purposes meanwhile, this will cause unexpected random memory changes. To prevent corruption, we rather panic in that case. Make it obvious from stack traces that this was the case by introducing drbd_panic_after_delayed_completion_of_aborted_request(). Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Andreas Gruenbacher 提交于
Instead of using a rwlock for synchronizing state changes across resources, take the request locks of all resources for global state changes. Use resources_mutex to serialize global state changes. This means that taking the request lock of a resource is now enough to prevent changes of that resource. (Previously, a read lock on the global state lock was needed as well.) Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 29 7月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 11 11月, 2014 1 次提交
-
-
由 Lars Ellenberg 提交于
If for some reason DRBD resync was the only activity on a backend device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is still initialization time, and keep throttling the resync. This patch explicitly initializes ->rs_last_events to the current backend event counters, and drops the rs_last_events == 0 from the throttle condition. Reported-by: NMikhail Sugakov <msugakov@amazon.de> Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 11 9月, 2014 5 次提交
-
-
由 Lars Ellenberg 提交于
The worker may now dequeue work items in batches. This should reduce lock contention during busy periods. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Lars Ellenberg 提交于
Shorten receive path in the asender thread. Reduces CPU utilisation of asender when receiving packets, and with that increases IOPs. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Andreas Gruenbacher 提交于
This macro doesn't add any value; just use test_bit() instead. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Andreas Gruenbacher 提交于
These macros can easily be replaced with its definition. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Andreas Gruenbacher 提交于
Now they follow the _endio naming sheme. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 11 7月, 2014 13 次提交
-
-
由 Lars Ellenberg 提交于
Add a per-connection worker thread callback_history with timing details, call site and callback function. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
To be able to present timing details in debugfs, we need to track preparation/submit times of peer requests. Track peer request flags early, before they are put on the epoch_entry lists. Waiting for activity log transactions may be a major latency factor. We want to be able to present the peer_request state accurately in debugfs, and what it is waiting for. Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO. Set it only *after* calling drbd_al_begin_io(), clear it as soon as we call drbd_al_complete_io(). Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
Background resynchronisation does some "side-stepping", or throttles itself, if it detects application IO activity, and the current resync rate estimate is above the configured "cmin-rate". What was not detected: if there is no application IO, because it blocks on activity log transactions. Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests, and count non-zero as application IO activity. This counter is exposed at proc_details level 2 and above. Also make sure to release the currently locked resync extent if we side-step due to such voluntary throttling. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
Record (in jiffies) how much time a request spends in which stages. Followup commits will use and present this additional timing information so we can better locate and tackle the root causes of latency spikes, or present the backlog for asynchronous replication. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
For diagnostic purposes, track intent, start time and latest submit time of meta data IO. Move separate members from struct drbd_device into the embeded struct drbd_md_io. s/md_io_(page|in_use)/md_io.\1/ Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...) consistently "oldest first". That way finding the oldest not yet successfully processed request is simply list_first_entry_or_null. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
We sometimes do if (list_empty(&w.list)) drbd_queue_work(&q, &w.list); Removal (list_del_init) may happen outside all locks, after all pending work entries have been moved to an on-stack local work list. For not dynamically allocated, but embeded, work structs, we must avoid to re-add until it really was removed. Move that list_empty check inside the spin_lock(&q->q_lock) within the helper function, and change to list_empty_careful(). This may have been the reason for a list_add corruption inside drbd_queue_work(). Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
We try to limit the number of "in-flight" resync requests. One condition for that is the amount of requested data should not exceed half of what can be covered by our "max-buffers" setting. However we compared number of 4k pages with number of in-flight 512 Byte sectors, and this extra throttle triggered much earlier than intended. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
If we throttle resync because the socket sendbuffer is filling up, tell TCP about it, so it may expand the sendbuffer for us. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
Checksum based resync trades CPU cycles for network bandwidth, in situations where we expect much of the to-be-resynced blocks to be actually identical on both sides already. In a "network hickup" scenario, it won't help: all to-be-resynced blocks will typically be different. The use case is for the resync of *potentially* different blocks after crash recovery -- the crash recovery had marked larger areas (those covered by the activity log) as need-to-be-resynced, just in case. Most of those blocks will be identical. This option makes it possible to configure checksum based resync, but only actually use it for the first resync after primary crash. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
The last user was al_write_transaction, if called with "delegate", and the last user to call it with "delegate = true" was the receiver thread, which has no need to delegate, but can call it himself. Finally drop the delegate parameter, drop the extra w_al_write_transaction callback, and drop drbd_queue_work_front. Do not (yet) change dequeue_work_item to dequeue_work_batch, though. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
This replaces the md_sync_work member of struct drbd_device by a new MD_SYNC "work bit" in device->flags. This replaces the resync_start_work member of struct drbd_device by a new RS_START "work bit" in device->flags. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-
由 Lars Ellenberg 提交于
The recent fix to put_ldev() (correct ordering of access to local_cnt and state.disk; memory barrier in __drbd_set_state) guarantees that the cleanup happens exactly once. However it does not yet guarantee that the cleanup happens from worker context, the last put_ldev() may still happen from atomic context, which must not happen: blkdev_put() may sleep. Fix this by scheduling the cleanup to the worker instead, using a couple more bits in device->flags and a new helper, drbd_device_post_work(). Generalized the "resync progress" work to cover these new work bits. Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com> Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
-