提交 · 9f08cf088676c12a5b53bd5a29cf04f00c787b5d · openeuler / Kernel

01 12月, 2019 2 次提交

rss_stat: add support to detect RSS updates of external mm · e4dcad20

由 Joel Fernandes (Google) 提交于 11月 30, 2019

When a process updates the RSS of a different process, the rss_stat
tracepoint appears in the context of the process doing the update.  This
can confuse userspace that the RSS of process doing the update is
updated, while in reality a different process's RSS was updated.

This issue happens in reclaim paths such as with direct reclaim or
background reclaim.

This patch adds more information to the tracepoint about whether the mm
being updated belongs to the current process's context (curr field).  We
also include a hash of the mm pointer so that the process who the mm
belongs to can be uniquely identified (mm_id field).

Also vsprintf.c is refactored a bit to allow reuse of hashing code.

[akpm@linux-foundation.org: remove unused local `str']
[joelaf@google.com: inline call to ptr_to_hashval]
  Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home
  Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com
Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: NIoannis Ilkos <ilkos@google.com>
Acked-by: Petr Mladek <pmladek@suse.com>	[lib/vsprintf.c]
Cc: Tim Murray <timmurray@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e4dcad20

mm: emit tracepoint when RSS changes · b3d1411b

由 Joel Fernandes (Google) 提交于 11月 30, 2019

Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs.  Several Android teams have been using this patch in
various kernel trees for half a year now.  Many reported to me it is
really useful so I'm posting it upstream.

Initial patch developed by Tim Murray.  Changes I made from original
patch: o Prevent any additional space consumed by mm_struct.

Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already.  That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING.  However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding.  In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.

Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:

    CC      kernel/trace/power-traces.o
  In file included from ./include/trace/define_trace.h:102,
                   from ./include/trace/events/kmem.h:342,
                   from ./include/linux/mm.h:31,
                   from ./include/linux/ring_buffer.h:5,
                   from ./include/linux/trace_events.h:6,
                   from ./include/trace/events/power.h:12,
                   from kernel/trace/power-traces.c:15:
  ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
     struct trace_entry ent;    \

Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.orgCo-developed-by: NTim Murray <timmurray@google.com>
Signed-off-by: NTim Murray <timmurray@google.com>
Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b3d1411b

26 11月, 2019 1 次提交

io_uring: improve trace_io_uring_defer() trace point · 915967f6

由 Jens Axboe 提交于 11月 21, 2019

We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

915967f6

25 11月, 2019 1 次提交

writeback: fix -Wformat compilation warnings · 40363cf1

由 Qian Cai 提交于 11月 14, 2019

The commit f05499a0 ("writeback: use ino_t for inodes in
tracepoints") introduced a lot of GCC compilation warnings on s390,

In file included from ./include/trace/define_trace.h:102,
                 from ./include/trace/events/writeback.h:904,
                 from fs/fs-writeback.c:82:
./include/trace/events/writeback.h: In function
'trace_raw_output_writeback_page_template':
./include/trace/events/writeback.h:76:12: warning: format '%lu' expects
argument of type 'long unsigned int', but argument 4 has type 'ino_t'
{aka 'unsigned int'} [-Wformat=]
  TP_printk("bdi %s: ino=%lu index=%lu",
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:360:22: note: in definition of macro
'DECLARE_EVENT_CLASS'
  trace_seq_printf(s, print);     \
                      ^~~~~
./include/trace/events/writeback.h:76:2: note: in expansion of macro
'TP_printk'
  TP_printk("bdi %s: ino=%lu index=%lu",
  ^~~~~~~~~

Fix them by adding necessary casts where ino_t could be either "unsigned
int" or "unsigned long".

Fixes: f05499a0 ("writeback: use ino_t for inodes in tracepoints")
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NTejun Heo <tj@kernel.org>

40363cf1

23 11月, 2019 1 次提交

SUNRPC: Capture completion of all RPC tasks · a264abad

由 Chuck Lever 提交于 11月 20, 2019

RPC tasks on the backchannel never invoke xprt_complete_rqst(), so
there is no way to report their tk_status at completion. Also, any
RPC task that exits via rpc_exit_task() before it is replied to will
also disappear without a trace.

Introduce a trace point that is symmetrical with rpc_task_begin that
captures the termination status of each RPC task.

Odd, though, that I never see trace_rpc_task_complete, either in the
forward or backchannel. Should it be removed?
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

a264abad

21 11月, 2019 1 次提交

page_pool: Add API to update numa node · bc836748

由 Saeed Mahameed 提交于 11月 20, 2019

Add page_pool_update_nid() to be called by page pool consumers when they
detect numa node changes.

It will update the page pool nid value to start allocating from the new
effective numa node.

This is to mitigate page pool allocating pages from a wrong numa node,
where the pool was originally allocated, and holding on to pages that
belong to a different numa node, which causes performance degradation.

For pages that are already being consumed and could be returned to the
pool by the consumer, in next patch we will add a check per page to avoid
recycling them back to the pool and return them to the page allocator.
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc836748

19 11月, 2019 5 次提交

page_pool: extend tracepoint to also include the page PFN · 832ccf6f

由 Jesper Dangaard Brouer 提交于 11月 16, 2019

The MM tracepoint for page free (called kmem:mm_page_free) doesn't provide
the page pointer directly, instead it provides the PFN (Page Frame Number).
This is annoying when writing a page_pool leak detector in BPF.

This patch change page_pool tracepoints to also provide the PFN.
The page pointer is still provided to allow other kinds of
troubleshooting from BPF.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

832ccf6f

page_pool: add destroy attempts counter and rename tracepoint · 7c9e6942

由 Jesper Dangaard Brouer 提交于 11月 16, 2019

When Jonathan change the page_pool to become responsible to its
own shutdown via deferred work queue, then the disconnect_cnt
counter was removed from xdp memory model tracepoint.

This patch change the page_pool_inflight tracepoint name to
page_pool_release, because it reflects the new responsability
better.  And it reintroduces a counter that reflect the number of
times page_pool_release have been tried.

The counter is also used by the code, to only empty the alloc
cache once.  With a stuck work queue running every second and
counter being 64-bit, it will overrun in approx 584 billion
years. For comparison, Earth lifetime expectancy is 7.5 billion
years, before the Sun will engulf, and destroy, the Earth.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c9e6942

btrfs: rename btrfs_block_group_cache · 32da5386

由 David Sterba 提交于 10月 29, 2019

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

32da5386

btrfs: add dedicated members for start and length of a block group · b3470b5d

由 David Sterba 提交于 10月 23, 2019

The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.

Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key.  When the item is searched or written, a
local variable for key is used.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b3470b5d

btrfs: move block_group_item::used to block group · bf38be65

由 David Sterba 提交于 10月 23, 2019

For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.

The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bf38be65

18 11月, 2019 3 次提交

btrfs: tracepoints: constify all pointers · 1d2e7c7c

由 David Sterba 提交于 10月 17, 2019

We don't modify the data passed to tracepoints, some of the declarations
are already const, add it to the rest.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1d2e7c7c

btrfs: tracepoints: drop typecasts from printk · 94c3f6c6

由 David Sterba 提交于 10月 17, 2019

Remove typecasts from trace printk, adjust types and move typecast to
the assignment if necessary. When assigning, the types are more obvious
compared to matching the variables to the format strings.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

94c3f6c6

btrfs: get rid of pointless wtag variable in async-thread.c · c9eb55db

由 Omar Sandoval 提交于 9月 16, 2019

Commit ac0c7cf8 ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:

1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
   stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
   doesn't actually make things any more type-safe. (Note that the
   original bug in commit bc074524 ("btrfs: prefix fsid to all trace
   events") was that the void * was implicitly casted when it was passed
   to btrfs_work_owner() in the trace point itself).

Instead, let's add some clearer warnings as comments.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c9eb55db

17 11月, 2019 1 次提交

page_pool: do not release pool until inflight == 0. · c3f812ce

由 Jonathan Lemon 提交于 11月 14, 2019

The page pool keeps track of the number of pages in flight, and
it isn't safe to remove the pool until all pages are returned.

Disallow removing the pool until all pages are back, so the pool
is always available for page producers.

Make the page pool responsible for its own delayed destruction
instead of relying on XDP, so the page pool can be used without
the xdp memory model.

When all pages are returned, free the pool and notify xdp if the
pool is registered with the xdp memory system.  Have the callback
perform a table walk since some drivers (cpsw) may share the pool
among multiple xdp_rxq_info.

Note that the increment of pages_state_release_cnt may result in
inflight == 0, resulting in the pool being released.

Fixes: d956a048 ("xdp: force mem allocator removal and periodic warning")
Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3f812ce

15 11月, 2019 1 次提交

y2038: itimer: change implementation to timespec64 · bd40a175

由 Arnd Bergmann 提交于 11月 07, 2019

There is no 64-bit version of getitimer/setitimer since that is not
actually needed. However, the implementation is built around the
deprecated 'struct timeval' type.

Change the code to use timespec64 internally to reduce the dependencies
on timeval and associated helper functions.

Minor adjustments in the code are needed to make the native and compat
version work the same way, and to keep the range check working after
the conversion.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

bd40a175

13 11月, 2019 3 次提交

cgroup: use cgrp->kn->id as the cgroup ID · 74321038

由 Tejun Heo 提交于 11月 04, 2019

cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>

74321038

kernfs: convert kernfs_node->id from union kernfs_node_id to u64 · 67c0496e

由 Tejun Heo 提交于 11月 04, 2019

kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>

67c0496e

writeback: use ino_t for inodes in tracepoints · f05499a0

由 Tejun Heo 提交于 11月 04, 2019

Writeback TPs currently use mix of 32 and 64bits for inos.  This isn't
currently broken because only cgroup inos are using 32bits and they're
limited to 32bits.  cgroup inos will make use of 64bits.  Let's
uniformly use ino_t.

While at it, switch the default cgroup ino value used when cgroup is
disabled to 1 instead of -1U as root cgroup always uses ino 1.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Namhyung Kim <namhyung@kernel.org>

f05499a0

10 11月, 2019 1 次提交

tcp: remove redundant new line from tcp_event_sk_skb · dd3d792d

由 Tony Lu 提交于 11月 09, 2019

This removes '\n' from trace event class tcp_event_sk_skb to avoid
redundant new blank line and make output compact.

Fixes: af4325ec ("tcp: expose sk_state in tcp_retransmit_skb tracepoint")
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd3d792d

08 11月, 2019 2 次提交

fsi: aspeed: Add trace points · 913b7373

由 Joel Stanley 提交于 11月 08, 2019

These trace points help with debugging the FSI master. They show the low
level reads, writes and error states of the master.
Signed-off-by: NJoel Stanley <joel@jms.id.au>
Link: https://lore.kernel.org/r/20191108051945.7109-11-joel@jms.id.auSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

913b7373

trace: fsi: Print transfer size unsigned · ae774816

由 Andrew Jeffery 提交于 11月 08, 2019

Due to other bugs I observed a spurious -1 transfer size.
Signed-off-by: NAndrew Jeffery <andrew@aj.id.au>
Signed-off-by: NJoel Stanley <joel@jms.id.au>
Acked-by: NAlistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20191108051945.7109-5-joel@jms.id.auSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ae774816

06 11月, 2019 2 次提交

jbd2: Provide trace event for handle restarts · 0094f981

由 Jan Kara 提交于 11月 05, 2019

Provide trace event for handle restarts to ease debugging.
Reviewed-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-24-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

0094f981

ext4: Reserve revoke credits for freed blocks · 83448bdf

由 Jan Kara 提交于 11月 05, 2019

So far we have reserved only relatively high fixed amount of revoke
credits for each transaction. We over-reserved by large amount for most
cases but when freeing large directories or files with data journalling,
the fixed amount is not enough. In fact the worst case estimate is
inconveniently large (maximum extent size) for freeing of one extent.

We fix this by doing proper estimate of the amount of blocks that need
to be revoked when removing blocks from the inode due to truncate or
hole punching and otherwise reserve just a small amount of revoke
credits for each transaction to accommodate freeing of xattrs block or
so.
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

83448bdf

04 11月, 2019 1 次提交

io_uring: add completion trace event · 51c3ff62

由 Jens Axboe 提交于 11月 03, 2019

We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

51c3ff62

02 11月, 2019 2 次提交

io_uring: remove io_uring_add_to_prev() trace event · 0069fc6b

由 Jens Axboe 提交于 11月 01, 2019

This internal logic was killed with the conversion to io-wq, so we no
longer have a need for this particular trace. Kill it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0069fc6b

net: bridge: fdb: br_fdb_update can take flags directly · be0c5677

由 Nikolay Aleksandrov 提交于 11月 01, 2019

If we modify br_fdb_update() to take flags directly we can get rid of
one test and one atomic bitop in the learning path.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be0c5677

31 10月, 2019 1 次提交

SUNRPC: Trace gssproxy upcall results · ff27e9f7

由 Chuck Lever 提交于 10月 24, 2019

Record results of a GSS proxy ACCEPT_SEC_CONTEXT upcall and the
svc_authenticate() function to make field debugging of NFS server
Kerberos issues easier.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NBill Baker <bill.baker@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ff27e9f7

30 10月, 2019 5 次提交

P
rcu: Update descriptions for rcu_future_grace_period tracepoint · 7cc0fffd
由 Paul E. McKenney 提交于 8月 21, 2019
```
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
```
7cc0fffd
P
rcu: Update descriptions for rcu_nocb_wake tracepoint · d01f8620
由 Paul E. McKenney 提交于 8月 21, 2019
```
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
```
d01f8620
P
rcu: Remove obsolete descriptions for rcu_barrier tracepoint · 7eb54685
由 Paul E. McKenney 提交于 8月 20, 2019
```
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
```
7eb54685

io_uring: replace workqueue usage with io-wq · 561fb04a

由 Jens Axboe 提交于 10月 24, 2019

Drop various work-arounds we have for workqueues:

- We no longer need the async_list for tracking sequential IO.

- We don't have to maintain our own mm tracking/setting.

- We don't need a separate workqueue for buffered writes. This didn't
  even work that well to begin with, as it was suboptimal for multiple
  buffered writers on multiple files.

- We can properly cancel pending interruptible work. This fixes
  deadlocks with particularly socket IO, where we cannot cancel them
  when the io_uring is closed. Hence the ring will wait forever for
  these requests to complete, which may never happen. This is different
  from disk IO where we know requests will complete in a finite amount
  of time.

- Due to being able to cancel work interruptible work that is already
  running, we can implement file table support for work. We need that
  for supporting system calls that add to a process file table.

- It gets us one step closer to adding async support for any system
  call.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

561fb04a

io_uring: add set of tracing events · c826bd7a

由 Dmitrii Dolgov 提交于 10月 15, 2019

To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.

All such events could be roughly divided into two categories:

* those, that are helping to understand correctness (from both kernel
  and an application point of view). E.g. a ring creation, file
  registration, or waiting for available CQE. Proposed approach is to
  get a pointer to an original structure of interest (ring context, or
  request), and then find relevant events. io_uring_queue_async_work
  also exposes a pointer to work_struct, to be able to track down
  corresponding workqueue events.

* those, that provide performance related information. Mostly it's about
  events that change the flow of requests, e.g. whether an async work
  was queued, or delayed due to some dependencies. Another important
  case is how io_uring optimizations (e.g. registered files) are
  utilized.
Signed-off-by: NDmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c826bd7a

24 10月, 2019 7 次提交

xprtrdma: Replace dprintk in xprt_rdma_set_port · a52c23b8

由 Chuck Lever 提交于 10月 23, 2019

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a52c23b8

xprtrdma: Replace dprintk() in rpcrdma_update_connect_private() · f54c870d

由 Chuck Lever 提交于 10月 23, 2019

Clean up: Use a single trace point to record each connection's
negotiated inline thresholds and the computed maximum byte size
of transport headers.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

f54c870d

xprtrdma: Refine trace_xprtrdma_fixup · d4957f01

由 Chuck Lever 提交于 10月 23, 2019

Slightly reduce overhead and display more useful information.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d4957f01

xprtrdma: Report the computed connect delay · 7b020f17

由 Chuck Lever 提交于 10月 23, 2019

For debugging, the op_connect trace point should report the computed
connect delay. We can then ensure that the delay is computed at the
proper times, for example.

As a further clean-up, remove a few low-value "heartbeat" trace
points in the connect path.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

7b020f17

xprtrdma: Pull up sometimes · 614f3c96

由 Chuck Lever 提交于 10月 17, 2019

On some platforms, DMA mapping part of a page is more costly than
copying bytes. Restore the pull-up code and use that when we
think it's going to be faster. The heuristic for now is to pull-up
when the size of the RPC message body fits in the buffer underlying
the head iovec.

Indeed, not involving the I/O MMU can help the RPC/RDMA transport
scale better for tiny I/Os across more RDMA devices. This is because
interaction with the I/O MMU is eliminated, as is handling a Send
completion, for each of these small I/Os. Without the explicit
unmapping, the NIC no longer needs to do a costly internal TLB shoot
down for buffers that are just a handful of bytes.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

614f3c96

xprtrdma: Move the rpcrdma_sendctx::sc_wr field · dc15c3d5

由 Chuck Lever 提交于 10月 17, 2019

Clean up: This field is not needed in the Send completion handler,
so it can be moved to struct rpcrdma_req to reduce the size of
struct rpcrdma_sendctx, and to reduce the amount of memory that
is sloshed between the sending process and the Send completion
process.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

dc15c3d5

xprtrdma: Add unique trace points for posting Local Invalidate WRs · 4b93dab3

由 Chuck Lever 提交于 10月 09, 2019

When adding frwr_unmap_async way back when, I re-used the existing
trace_xprtrdma_post_send() trace point to record the return code
of ib_post_send.

Unfortunately there are some cases where re-using that trace point
causes a crash. Instead, construct a trace point specific to posting
Local Invalidate WRs that will always be safe to use in that context,
and will act as a trace log eye-catcher for Local Invalidation.

Fixes: 84756894 ("xprtrdma: Remove fr_state")
Fixes: d8099fed ("xprtrdma: Reduce context switching due ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NBill Baker <bill.baker@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

4b93dab3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功