提交 · ab3e1d3bbab9e973aeb4dd4603251578658a47ff · openeuler / Kernel

30 9月, 2022 2 次提交

block: allow end_io based requests in the completion batch handling · ab3e1d3b

由 Jens Axboe 提交于 9月 21, 2022

With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ab3e1d3b

block: change request end_io handler to pass back a return value · de671d61

由 Jens Axboe 提交于 9月 21, 2022

Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de671d61

29 9月, 2022 1 次提交

block: adapt blk_mq_plug() to not plug for writes that require a zone lock · 8cafdb5a

由 Pankaj Raghav 提交于 9月 29, 2022

The current implementation of blk_mq_plug() disables plugging for all
operations that involves a transfer to the device as we just check if
the last bit in op_is_write() function.

Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
REQ_OP_WRITE_ZEROS as they might require a zone lock.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Suggested-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8cafdb5a

27 9月, 2022 3 次提交

block: replace blk_queue_nowait with bdev_nowait · 568ec936

由 Christoph Hellwig 提交于 9月 27, 2022

Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

568ec936

nvme: improve the NVME_CONNECT_AUTHREQ* definitions · 1c32a801

由 Christoph Hellwig 提交于 9月 20, 2022

Mark them as unsigned so that we don't need extra casts, and define
them relative to cdword0 instead of requiring extra shifts.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>

1c32a801

blk-cgroup: pass a gendisk to blkcg_schedule_throttle · de185b56

由 Christoph Hellwig 提交于 9月 21, 2022

Pass the gendisk to blkcg_schedule_throttle as part of moving the
blk-cgroup infrastructure to be gendisk based.  Remove the unused
!BLK_CGROUP stub while we're at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAndreas Herrmann <aherrmann@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

de185b56

24 9月, 2022 3 次提交

ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support · c732a852

由 ZiyangZhang 提交于 9月 23, 2022

START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
to support user recovery feature.

After a crash, user should send START_USER_RECOVERY, it will:
(1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
    set by quiesce_work and (b)chardev is released
(2) reinit all ubqs, including:
    (a) put the task_struct and reset ->ubq_daemon to NULL.
    (b) reset all ublk_io.
(3) reset ub->mm to NULL.

Then, user should start a new process and send FETCH_REQ on each
ubq_daemon.

Finally, user should send END_USER_RECOVERY, it will:
(1) wait for all new ubq_daemons getting ready.
(2) update ublksrv_pid
(3) unquiesce the request queue and expect incoming ublk_queue_rq()
(4) convert ub's state to UBLK_S_DEV_LIVE

Note: we can handle STOP_DEV between START_USER_RECOVERY and
END_USER_RECOVERY. This is helpful to users who cannot start new process
after sending START_USER_RECOVERY ctrl-cmd.
Signed-off-by: NZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-7-ZiyangZhang@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c732a852

ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE · a0d41dc1

由 ZiyangZhang 提交于 9月 23, 2022

UBLK_F_USER_RECOVERY_REISSUE implies that:
With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to
userspace(ublksrv) before the ubq_daemon is dying.

UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
(1) tolerate double-write since ublk_drv may issue the same rq
    twice.
(2) does not let frontend users get I/O error, such as read-only FS
    and VM backend.
Signed-off-by: NZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-6-ZiyangZhang@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a0d41dc1

ublk_drv: define macros for recovery feature and check them · 77a440e2

由 ZiyangZhang 提交于 9月 23, 2022

Define some macros for recovery feature.

UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced
and is ready for recovery. This state can be observed by userspace.

UBLK_F_USER_RECOVERY implies that:
(1) ublk_drv enables recovery feature. It won't let monitor_work to
    automatically abort rqs and release the device.
(2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to
    userspace(ublksrv) before crash.
(3) With a dying ubq_daemon, in task work and ublk_queue_rq(),
    ublk_drv requeues rqs.
Signed-off-by: NZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-3-ZiyangZhang@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

77a440e2

23 9月, 2022 1 次提交

vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment · 13b05669

由 Will Deacon 提交于 9月 22, 2022

Due to undocumented, hysterical raisins on x86, the CFI jump-table
sections in .text are needlessly aligned to PMD_SIZE in the vmlinux
linker script. When compiling a CFI-enabled arm64 kernel with a 64KiB
page-size, a PMD maps 512MiB of virtual memory and so the .text section
increases to a whopping 940MiB and blows the final Image up to 960MiB.
Others report a link failure.

Since the CFI jump-table requires only instruction alignment, reduce the
alignment directives to function alignment for parity with other parts
of the .text section. This reduces the size of the .text section for the
aforementioned 64KiB page size arm64 kernel to 19MiB for a much more
reasonable total Image size of 39MiB.

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Mohan Rao .vanimina" <mailtoc.mohanrao@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/CAL_GTzigiNOMYkOPX1KDnagPhJtFNqSK=1USNbS0wUL4PW6-Uw@mail.gmail.com/
Fixes: cf68fffb ("add support for Clang CFI")
Reviewed-by: NMark Rutland <mark.rutland@arm.com>
Tested-by: NMark Rutland <mark.rutland@arm.com>
Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220922215715.13345-1-will@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>

13b05669

22 9月, 2022 10 次提交

drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES · d7f06bdd

由 Phil Auld 提交于 9月 06, 2022

As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS <= 3.
This leads to very large file sizes:

topology$ ls -l
total 0
-r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 core_cpus
-r--r--r-- 1 root root                 4096 Sep  5 11:59 core_cpus_list
-r--r--r-- 1 root root                 4096 Sep  5 10:58 core_id
-r--r--r-- 1 root root 18446744073709551615 Sep  5 10:10 core_siblings
-r--r--r-- 1 root root                 4096 Sep  5 11:59 core_siblings_list
-r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 die_cpus
-r--r--r-- 1 root root                 4096 Sep  5 11:59 die_cpus_list
-r--r--r-- 1 root root                 4096 Sep  5 11:59 die_id
-r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 package_cpus
-r--r--r-- 1 root root                 4096 Sep  5 11:59 package_cpus_list
-r--r--r-- 1 root root                 4096 Sep  5 10:58 physical_package_id
-r--r--r-- 1 root root 18446744073709551615 Sep  5 10:10 thread_siblings
-r--r--r-- 1 root root                 4096 Sep  5 11:59 thread_siblings_list

Adjust the inequality to catch the case when NR_CPUS is configured
to a small value.

Fixes: 7ee951ac ("drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: stable@vger.kernel.org
Cc: feng xiangjun <fengxj325@gmail.com>
Reported-by: Nfeng xiangjun <fengxj325@gmail.com>
Signed-off-by: NPhil Auld <pauld@redhat.com>
Signed-off-by: NYury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20220906203542.1796629-1-pauld@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d7f06bdd

io_uring/net: zerocopy sendmsg · 493108d9

由 Pavel Begunkov 提交于 9月 21, 2022

Add a zerocopy version of sendmsg.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

493108d9

fs: add batch and poll flags to the uring_cmd_iopoll() handler · de97fcb3

由 Jens Axboe 提交于 9月 02, 2022

We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de97fcb3

block: export blk_rq_is_poll · c6e99ea4

由 Kanchan Joshi 提交于 8月 23, 2022

This is in preparation to support iopoll for nvme passthrough.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c6e99ea4

io_uring: add iopoll infrastructure for io_uring_cmd · 5756a3a7

由 Kanchan Joshi 提交于 8月 23, 2022

Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5756a3a7

fs: add file_operations->uring_cmd_iopoll · de27e18e

由 Kanchan Joshi 提交于 8月 23, 2022

io_uring will invoke this to do completion polling on uring-cmd
operations.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-2-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

de27e18e

io_uring: trace local task work run · f75d5036

由 Dylan Yudaken 提交于 8月 30, 2022

Add tracing for io_run_local_task_work
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f75d5036

io_uring: signal registered eventfd to process deferred task work · 21a091b9

由 Dylan Yudaken 提交于 8月 30, 2022

Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.

In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.

This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21a091b9

io_uring: add IORING_SETUP_DEFER_TASKRUN · c0e0d6ba

由 Dylan Yudaken 提交于 8月 30, 2022

Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.

Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.

For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.

The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.

This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.

There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.

[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100"
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c0e0d6ba

eventfd: guard wake_up in eventfd fs calls as well · 9f0deaa1

由 Dylan Yudaken 提交于 8月 16, 2022

Guard wakeups that the user can trigger, and that may end up triggering a
call back into eventfd_signal. This is in addition to the current approach
that only guards in eventfd_signal.

Rename in_eventfd_signal -> in_eventfd at the same time to reflect this.

Without this there would be a deadlock in the following code using libaio:

int main()
{
	struct io_context *ctx = NULL;
	struct iocb iocb;
	struct iocb *iocbs[] = { &iocb };
	int evfd;
        uint64_t val = 1;

	evfd = eventfd(0, EFD_CLOEXEC);
	assert(!io_setup(2, &ctx));
	io_prep_poll(&iocb, evfd, POLLIN);
	io_set_eventfd(&iocb, evfd);
	assert(1 == io_submit(ctx, 1, iocbs));
        write(evfd, &val, 8);
}
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220816135959.1490641-1-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9f0deaa1

21 9月, 2022 2 次提交

block: Fix the enum blk_eh_timer_return documentation · b2bed51a

由 Bart Van Assche 提交于 9月 20, 2022

The documentation of the blk_eh_timer_return enumeration values does not
reflect correctly how e.g. the SCSI core uses these values. Fix the
documentation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 88b0cfad ("block: document the blk_eh_timer_return values")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220920200626.3422296-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

b2bed51a

Revert "iommu/vt-d: Fix possible recursive locking in intel_iommu_init()" · 7ebb5f8e

由 Lu Baolu 提交于 9月 21, 2022

This reverts commit 9cd4f143.

Some issues were reported on the original commit. Some thunderbolt devices
don't work anymore due to the following DMA fault.

DMAR: DRHD: handling fault status reg 2
DMAR: [INTR-REMAP] Request device [09:00.0] fault index 0x8080
      [fault reason 0x25]
      Blocked a compatibility format interrupt request

Bring it back for now to avoid functional regression.

Fixes: 9cd4f143 ("iommu/vt-d: Fix possible recursive locking in intel_iommu_init()")
Link: https://lore.kernel.org/linux-iommu/485A6EA5-6D58-42EA-B298-8571E97422DE@getmailspring.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216497
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: <stable@vger.kernel.org> # 5.19.x
Reported-and-tested-by: NGeorge Hilliard <thirtythreeforty@gmail.com>
Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20220920081701.3453504-1-baolu.lu@linux.intel.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

7ebb5f8e

20 9月, 2022 2 次提交

block: remove PSI accounting from the bio layer · 118f3663

由 Christoph Hellwig 提交于 9月 15, 2022

PSI accounting is now done by the VM code, where it should have been
since the beginning.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

118f3663

mm: add PSI accounting around ->read_folio and ->readahead calls · 17604240

由 Christoph Hellwig 提交于 9月 15, 2022

PSI tries to account for the cost of bringing back in pages discarded by
the MM LRU management.  Currently the prime place for that is hooked into
the bio submission path, which is a rather bad place:

 - it does not actually account I/O for non-block file systems, of which
   we have many
 - it adds overhead and a layering violation to the block layer

Add the accounting into the two places in the core MM code that read
pages into an address space by calling into ->read_folio and ->readahead
so that the entire file system operations are covered, to broaden
the coverage and allow removing the accounting in the block layer going
forward.

As psi_memstall_enter can deal with nested calls this will not lead to
double accounting even while the bio annotations are still present.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

17604240

16 9月, 2022 2 次提交

net: bonding: Share lacpdu_mcast_addr definition · 1d9a143e

由 Benjamin Poirier 提交于 9月 07, 2022

There are already a few definitions of arrays containing
MULTICAST_LACPDU_ADDR and the next patch will add one more use. These all
contain the same constant data so define one common instance for all
bonding code.
Signed-off-by: NBenjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d9a143e

net/ieee802154: fix uninit value bug in dgram_sendmsg · 94160108

由 Haimin Zhang 提交于 9月 08, 2022

There is uninit value bug in dgram_sendmsg function in
net/ieee802154/socket.c when the length of valid data pointed by the
msg->msg_name isn't verified.

We introducing a helper function ieee802154_sockaddr_check_size to
check namelen. First we check there is addr_type in ieee802154_addr_sa.
Then, we check namelen according to addr_type.

Also fixed in raw_bind, dgram_bind, dgram_connect.
Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94160108

13 9月, 2022 1 次提交

Input: hp_sdc: fix spelling typo in comment · 4b9d1bc7

由 Jiangshan Yi 提交于 9月 06, 2022

Fix spelling typo in comment.
Reported-by: Nk2ci <kernel-bot@kylinos.cn>
Signed-off-by: NJiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: NHelge Deller <deller@gmx.de>

4b9d1bc7

12 9月, 2022 2 次提交

blk-throttle: fix that io throttle can only work for single bio · 320fb0f9

由 Yu Kuai 提交于 8月 29, 2022

Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &

Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s

The problem is that the second bio is finished after 10s instead of 20s.

Root cause:
1) second bio will be flagged:

__blk_throtl_bio
 while (true) {
  ...
  if (sq->nr_queued[rw]) -> some bio is throttled already
   break
 };
 bio_set_flag(bio, BIO_THROTTLED); -> flag the bio

2) flagged bio will be dispatched without waiting:

throtl_dispatch_tg
 tg_may_dispatch
  tg_with_in_bps_limit
   if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
    *wait = 0; -> wait time is zero
    return true;

commit 9f5ede3c ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.

In order to fix the problem, handle iops/bps limit in different ways:

1) for iops limit, there is no flag to record if the bio is throttled,
   and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
   and io throttle will ignore bio with the flag.

Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be883 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.

Fixes: 9f5ede3c ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

320fb0f9

sbitmap: fix batched wait_cnt accounting · 4acb8341

由 Keith Busch 提交于 9月 09, 2022

Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679Signed-off-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4acb8341

11 9月, 2022 1 次提交

iommu/vt-d: Fix possible recursive locking in intel_iommu_init() · 9cd4f143

由 Lu Baolu 提交于 9月 11, 2022

The global rwsem dmar_global_lock was introduced by commit 3a5670e8
("iommu/vt-d: Introduce a rwsem to protect global data structures"). It
is used to protect DMAR related global data from DMAR hotplug operations.

The dmar_global_lock used in the intel_iommu_init() might cause recursive
locking issue, for example, intel_iommu_get_resv_regions() is taking the
dmar_global_lock from within a section where intel_iommu_init() already
holds it via probe_acpi_namespace_devices().

Using dmar_global_lock in intel_iommu_init() could be relaxed since it is
unlikely that any IO board must be hot added before the IOMMU subsystem is
initialized. This eliminates the possible recursive locking issue by moving
down DMAR hotplug support after the IOMMU is initialized and removing the
uses of dmar_global_lock in intel_iommu_init().

Fixes: d5692d4a ("iommu/vt-d: Fix suspicious RCU usage in probe_acpi_namespace_devices()")
Reported-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/894db0ccae854b35c73814485569b634237b5538.1657034828.git.robin.murphy@arm.com
Link: https://lore.kernel.org/r/20220718235325.3952426-1-baolu.lu@linux.intel.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

9cd4f143

10 9月, 2022 1 次提交

Bluetooth: Fix HCIGETDEVINFO regression · 35e60f1a

由 Luiz Augusto von Dentz 提交于 9月 08, 2022

Recent changes breaks HCIGETDEVINFO since it changes the size of
hci_dev_info.

Fixes: 26afbd82 ("Bluetooth: Add initial implementation of CIS connections")
Reported-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>

35e60f1a

08 9月, 2022 1 次提交

fs: only do a memory barrier for the first set_buffer_uptodate() · 2f79cdfe

由 Linus Torvalds 提交于 8月 31, 2022

Commit d4252071 ("add barriers to buffer_uptodate and
set_buffer_uptodate") added proper memory barriers to the buffer head
BH_Uptodate bit, so that anybody who tests a buffer for being up-to-date
will be guaranteed to actually see initialized state.

However, that commit didn't _just_ add the memory barrier, it also ended
up dropping the "was it already set" logic that the BUFFER_FNS() macro
had.

That's conceptually the right thing for a generic "this is a memory
barrier" operation, but in the case of the buffer contents, we really
only care about the memory barrier for the _first_ time we set the bit,
in that the only memory ordering protection we need is to avoid anybody
seeing uninitialized memory contents.

Any other access ordering wouldn't be about the BH_Uptodate bit anyway,
and would require some other proper lock (typically BH_Lock or the folio
lock).  A reader that races with somebody invalidating the buffer head
isn't an issue wrt the memory ordering, it's a serialization issue.

Now, you'd think that the buffer head operations don't matter in this
day and age (and I certainly thought so), but apparently some loads
still end up being heavy users of buffer heads.  In particular, the
kernel test robot reported that not having this bit access optimization
in place caused a noticeable direct IO performance regression on ext4:

  fxmark.ssd_ext4_no_jnl_DWTL_54_directio.works/sec -26.5% regression

although you presumably need a fast disk and a lot of cores to actually
notice.

Link: https://lore.kernel.org/all/Yw8L7HTZ%2FdE2%2Fo9C@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: NFengwei Yin <fengwei.yin@intel.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2f79cdfe

07 9月, 2022 3 次提交

serial: Create uart_xmit_advance() · e77cab77

由 Ilpo Järvinen 提交于 9月 01, 2022

A very common pattern in the drivers is to advance xmit tail
index and do bookkeeping of Tx'ed characters. Create
uart_xmit_advance() to handle it.
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20220901143934.8850-2-ilpo.jarvinen@linux.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e77cab77

net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM · 9cb252c4

由 Menglong Dong 提交于 9月 05, 2022

As Eric reported, the 'reason' field is not presented when trace the
kfree_skb event by perf:

$ perf record -e skb:kfree_skb -a sleep 10
$ perf script
  ip_defrag 14605 [021]   221.614303:   skb:kfree_skb:
  skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
  reason:

The cause seems to be passing kernel address directly to TP_printk(),
which is not right. As the enum 'skb_drop_reason' is not exported to
user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
string from the 'reason' field, which is a number.

Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
DEFINE_DROP_REASON(), now we can remove the auto-generate that we
introduced in the commit ec43908d
("net: skb: use auto-generation to convert skb drop reason to string"),
and define the string array 'drop_reasons'.

Hmmmm...now we come back to the situation that have to maintain drop
reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
are both in dropreason.h, which makes it easier.

After this commit, now the format of kfree_skb is like this:

$ cat /tracing/events/skb/kfree_skb/format
name: kfree_skb
ID: 1524
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:void * skbaddr;   offset:8;       size:8; signed:0;
        field:void * location;  offset:16;      size:8; signed:0;
        field:unsigned short protocol;  offset:24;      size:2; signed:0;
        field:enum skb_drop_reason reason;      offset:28;      size:4; signed:0;

print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......

Fixes: ec43908d ("net: skb: use auto-generation to convert skb drop reason to string")
Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/Reported-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9cb252c4

dma-mapping: mark dma_supported static · 9fc18f6d

由 Christoph Hellwig 提交于 8月 21, 2022

Now that the remaining users in drivers are gone, this function can be
marked static.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9fc18f6d

06 9月, 2022 1 次提交

PCI: Move PCI_VENDOR_ID_MICROSOFT/PCI_DEVICE_ID_HYPERV_VIDEO definitions to pci_ids.h · 8409fe92

由 Vitaly Kuznetsov 提交于 8月 27, 2022

There are already three places in kernel which define
PCI_VENDOR_ID_MICROSOFT and two for PCI_DEVICE_ID_HYPERV_VIDEO and
there's a need to use these from core VMBus code. Move the defines where
they belong.

No functional change.
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20220827130345.1320254-2-vkuznets@redhat.comSigned-off-by: NWei Liu <wei.liu@kernel.org>

8409fe92

05 9月, 2022 3 次提交

asm-generic: Conditionally enable do_softirq_own_stack() via Kconfig. · 8cbb2b50

由 Sebastian Andrzej Siewior 提交于 8月 25, 2022

Remove the CONFIG_PREEMPT_RT symbol from the ifdef around
do_softirq_own_stack() and move it to Kconfig instead.

Enable softirq stacks based on SOFTIRQ_ON_OWN_STACK which depends on
HAVE_SOFTIRQ_ON_OWN_STACK and its default value is set to !PREEMPT_RT.
This ensures that softirq stacks are not used on PREEMPT_RT and avoids
a 'select' statement on an option which has a 'depends' statement.

Link: https://lore.kernel.org/YvN5E%2FPrHfUhggr7@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

8cbb2b50

RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile · 9ca05b0f

由 Maher Sanalla 提交于 8月 29, 2022

When the RDMA auxiliary driver probes, it sets its profile based on
devlink driverinit value. The latter might not be in sync with FW yet
(In case devlink reload is not performed), thus causing a mismatch
between RDMA driver and FW. This results in the following FW syndrome
when the RDMA driver tries to adjust RoCE state, which fails the probe:

"0xC1F678 | modify_nic_vport_context: roce_en set on a vport that
doesn't support roce"

To prevent this, select the PF profile based on FW RoCE capability
instead of relying on devlink driverinit value.
To provide backward compatibility of the RoCE disable feature, on older
FW's where roce_rw is not set (FW RoCE capability is read-only), keep
the current behavior e.g., rely on devlink driverinit value.

Fixes: fbfa97b4 ("net/mlx5: Disable roce at HCA level")
Reviewed-by: NShay Drory <shayd@nvidia.com>
Reviewed-by: NMichael Guralnik <michaelgur@nvidia.com>
Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com>
Signed-off-by: NMaher Sanalla <msanalla@nvidia.com>
Link: https://lore.kernel.org/r/cb34ce9a1df4a24c135cb804db87f7d2418bd6cc.1661763459.git.leonro@nvidia.comSigned-off-by: NLeon Romanovsky <leon@kernel.org>

9ca05b0f

debugfs: add debugfs_lookup_and_remove() · dec9b2f1

由 Greg Kroah-Hartman 提交于 9月 02, 2022

There is a very common pattern of using
debugfs_remove(debufs_lookup(..)) which results in a dentry leak of the
dentry that was looked up.  Instead of having to open-code the correct
pattern of calling dput() on the dentry, create
debugfs_lookup_and_remove() to handle this pattern automatically and
properly without any memory leaks.

Cc: stable <stable@kernel.org>
Reported-by: NKuyo Chang <kuyo.chang@mediatek.com>
Tested-by: NKuyo Chang <kuyo.chang@mediatek.com>
Link: https://lore.kernel.org/r/YxIaQ8cSinDR881k@kroah.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

dec9b2f1

04 9月, 2022 1 次提交

Revert "sbitmap: fix batched wait_cnt accounting" · bce1b56c

由 Jens Axboe 提交于 9月 04, 2022

This reverts commit 16ede669.

This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.

Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/Signed-off-by: NJens Axboe <axboe@kernel.dk>

bce1b56c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功