- 03 9月, 2019 7 次提交
-
-
由 Kent Overstreet 提交于
The race was when a thread using closure_sync() notices cl->s->done == 1 before the thread calling closure_put() calls wake_up_process(). Then, it's possible for that thread to return and exit just before wake_up_process() is called - so we're trying to wake up a process that no longer exists. rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier somewhere in the process teardown path. Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com> Acked-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dan Carpenter 提交于
The copy_to_user() function returns the number of bytes remaining to be copied, but the intention here was to return -EFAULT if the copy fails. Fixes: cafe5635 ("bcache: A block layer cache") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Shile Zhang 提交于
Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long time with huge cache after long run. Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Tested-by: NHeitor Alves de Siqueira <halves@canonical.com> Signed-off-by: NColy Li <colyli@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Marcos Paulo de Souza 提交于
This argument was not being considered since blk-mq was set by default, so removed this documentation to avoid confusion. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> .txt file is now .rst Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Marcos Paulo de Souza 提交于
This argument was ignored since blk-mq was set as default, so remove it from documentation. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> .txt file is now .rst Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Marcos Paulo de Souza 提交于
Since the inclusion of blk-mq, elevator argument was not being considered anymore, and it's utility died long with the legacy IO path, now removed too. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> Fold with doc removal patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Commit 7211aef8 ("block: mq-deadline: Fix write completion handling") added a call to blk_mq_sched_mark_restart_hctx() in dd_dispatch_request() to make sure that write request dispatching does not stall when all target zones are locked. This fix left a subtle race when a write completion happens during a dispatch execution on another CPU: CPU 0: Dispatch CPU1: write completion dd_dispatch_request() lock(&dd->lock); ... lock(&dd->zone_lock); dd_finish_request() rq = find request lock(&dd->zone_lock); unlock(&dd->zone_lock); zone write unlock unlock(&dd->zone_lock); ... __blk_mq_free_request check restart flag (not set) -> queue not run ... if (!rq && have writes) blk_mq_sched_mark_restart_hctx() unlock(&dd->lock) Since the dispatch context finishes after the write request completion handling, marking the queue as needing a restart is not seen from __blk_mq_free_request() and blk_mq_sched_restart() not executed leading to the dispatch stall under 100% write workloads. Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from dd_dispatch_request() into dd_finish_request() under the zone lock to ensure full mutual exclusion between write request dispatch selection and zone unlock on write request completion. Fixes: 7211aef8 ("block: mq-deadline: Fix write completion handling") Cc: stable@vger.kernel.org Reported-by: NHans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by: NHans Holmberg <hans.holmberg@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 8月, 2019 2 次提交
-
-
由 Tejun Heo 提交于
page->mapping may encode different values in it and page_mapping() should always be used to access the mapping pointer. track_foreign_dirty tracepoint was incorrectly accessing page->mapping directly. Use page_mapping() instead. Also, add NULL checks while at it. Fixes: 3a8e9ac8 ("writeback: add tracepoints for cgroup foreign writebacks") Reported-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvme由 Jens Axboe 提交于
Pull NVMe changes from Sagi: "The nvme updates include: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." * 'nvme-5.4' of git://git.infradead.org/nvme: (30 commits) nvme-rdma: Use rq_dma_dir macro nvme-fc: Use rq_dma_dir macro nvme-pci: Tidy up nvme_unmap_data nvme: make fabrics command run on a separate request queue nvme-pci: Support shared tags across queues for Apple 2018 controllers nvme-pci: Add support for Apple 2018+ models nvme-pci: Add support for variable IO SQ element size nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros nvme: trace bio completion nvme-multipath: fix ana log nsid lookup when nsid is not found nvmet-tcp: Add TOS for tcp transport nvme-tcp: Add TOS for tcp transport nvme-tcp: Use struct nvme_ctrl directly nvme-rdma: Add TOS for rdma transport nvme-fabrics: Add type of service (TOS) configuration nvmet-tcp: fix possible memory leak nvmet-tcp: fix possible NULL deref nvmet: trace: parse Get LBA Status command in detail nvme: trace: parse Get LBA Status command in detail nvme: trace: support for Get LBA Status opcode parsed ...
-
- 30 8月, 2019 31 次提交
-
-
由 Tejun Heo 提交于
cgroup foreign inode handling has quite a bit of heuristics and internal states which sometimes makes it difficult to understand what's going on. Add tracepoints to improve visibility. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Israel Rukshin 提交于
Remove code duplication. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
Remove code duplication. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
Remove pointless local variable and use rq_dma_dir macro. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
Another issue with the Apple T2 based 2018 controllers seem to be that they blow up (and shut the machine down) if there's a tag collision between the IO queue and the Admin queue. My suspicion is that they use our tags for their internal tracking and don't mix them with the queue id. They also seem to not like when tags go beyond the IO queue depth, ie 128 tags. This adds a quirk that marks tags 0..31 of the IO queue reserved Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Acked-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
Based on reverse engineering and original patch by Paul Pawlowski <paul@mrarm.io> This adds support for Apple weird implementation of NVME in their 2018 or later machines. It accounts for the twice-as-big SQ entries for the IO queues, and the fact that only interrupt vector 0 appears to function properly. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
The size of a submission queue element should always be 6 (64 bytes) by spec. However some controllers such as Apple's are not properly implementing the standard and require a different size. This provides the ground work for the subsequent quirks for these controllers. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
This will make it easier to handle variable queue entry sizes later. No functional change. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Hannes Reinecke 提交于
When native multipathing is enabled we cannot enable blktrace for the underlying paths, so any completion is never traced. Signed-off-by: NHannes Reinecke <hare@suse.com> [fixed-up by Mikhail for non-multipath-build] Signed-off-by: NMikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Anton Eidelman 提交于
ANA log parsing invokes nvme_update_ana_state() per ANA group desc. This updates the state of namespaces with nsids in desc->nsids[]. Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid. Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces: - if current namespace matches the current desc->nsids[n], this namespace is updated, and n is incremented. - the process stops when it encounters the end of either ctrl->namespaces end or desc->nsids[] In case desc->nsids[n] does not match any of ctrl->namespaces, the remaining nsids following desc->nsids[n] will not be updated. Such situation was considered abnormal and generated WARN_ON_ONCE. However ANA log MAY contain nsids not (yet) found in ctrl->namespaces. For example, lets consider the following scenario: - nvme0 exposes namespaces with nsids = [2, 3] to the host - a new namespace nsid = 1 is added dynamically - also, a ANA topology change is triggered - NS_CHANGED aen is generated and triggers scan_work - before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA aen was issues and ana_work receives ANA log with nsids=[1, 2, 3] Result: ana_work fails to update ANA state on existing namespaces [2, 3] Solution: Change the way nvme_update_ana_state() namespace list walk checks the current namespace against desc->nsids[n] as follows: a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces. b) ns->head->ns_id == desc->nsids[n]: match, update the namespace c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1] This enables correct operation in the scenario described above. This also allows ANA log to contain nsids currently invisible to the host, i.e. inactive nsids. Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
Set the outgoing packets type of service (TOS) according to the receiving TOS. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Suggested-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
TOS provide clients the ability to segregate traffic flows for different type of data. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. usage examples: nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
This patch doesn't change any functionality. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
For RDMA transports, TOS is an extension of IB QoS to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. Note: In addition to the TOS configuration, QOS must be configured on the relevant HCA on the target (send RDMA commands) and initiator to effect the traffic. usage examples: nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
TOS is user-defined and needs to be configured via nvme-cli. It must be set before initiating any traffic and once set the TOS cannot be changed. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
when we uninit a command in error flow we also need to free an iovec if it was allocated. Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
We must only call sgl_free for sgl that we actually allocated. Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Minwoo Im 提交于
Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing in target side also. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Minwoo Im 提交于
Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Minwoo Im 提交于
This patch adds Get LBA Status command's opcode to the macro that is used by the trace feature. Now we can see "get_lba_status" instead of the opcode value itself. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Minwoo Im 提交于
NVMe 1.4 added Get LBA Status command with opcode 0x86. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Tom Wu 提交于
In nvme spec 1.3 there is a definition for data write/read counters from SMART log, (See section 5.14.1.2): This value is reported in thousands (i.e., a value of 1 corresponds to 1000 units of 512 bytes read) and is rounded up. However, in nvme target where value is reported with actual units, but not thousands of units as the spec requires. Signed-off-by: NTom Wu <tomwu@mellanox.com> Reviewed-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
Simple polling support via socket busy_poll interface. Although we do not shutdown interrupts but simply hammer the socket poll, we can sometimes find completions faster than the normal interrupt driven RX path. We add per queue nr_cqe counter that resets every time RX path is invoked such that .poll callback can return it to stay consistent with the semantics. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Minwoo Im 提交于
The tcp host module is now taking those APIs from crypto ahash: (1) crypto_ahash_final() (2) crypto_ahash_digest() (3) crypto_alloc_ahash() nvme-tcp should depends on CRYPTO_CRC32C. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
Align with what the rest of the transports are doing. Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
No need to use a stack cap variable. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Potnuri Bharat Teja 提交于
Using socket specific read_sock() calls instead of directly calling tcp_read_sock() helps lld module registered handlers if any, to be called from nvme-tcp host. This patch therefore replaces the tcp_read_sock() with socket specific prot_ops. Signed-off-by: NPotnuri Bharat Teja <bharat@chelsio.com> Acked-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-