1. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  2. 29 5月, 2018 10 次提交
  3. 25 5月, 2018 1 次提交
  4. 16 5月, 2018 1 次提交
  5. 15 5月, 2018 1 次提交
  6. 14 5月, 2018 1 次提交
  7. 11 5月, 2018 2 次提交
  8. 02 5月, 2018 1 次提交
  9. 27 4月, 2018 1 次提交
  10. 19 4月, 2018 2 次提交
  11. 12 4月, 2018 1 次提交
    • S
      mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event · d51d1e64
      Steven Rostedt 提交于
      The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
      parameters! Seven of them are from the reclaim_stat structure.  This
      structure is currently local to mm/vmscan.c.  By moving it to the global
      vmstat.h header, we can also reference it from the vmscan tracepoints.
      In moving it, it brings down the overhead of passing so many arguments
      to the trace event.  In the future, we may limit the number of arguments
      that a trace event may pass (ideally just 6, but more realistically it
      may be 8).
      
      Before this patch, the code to call the trace event is this:
      
       0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
       48 8b 45 a0             mov    -0x60(%rbp),%rax
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       44 8b 6d d4             mov    -0x2c(%rbp),%r13d
       8b 4d d0                mov    -0x30(%rbp),%ecx
       44 8b 75 cc             mov    -0x34(%rbp),%r14d
       44 8b 7d c8             mov    -0x38(%rbp),%r15d
       48 89 45 90             mov    %rax,-0x70(%rbp)
       8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
       8b 55 c0                mov    -0x40(%rbp),%edx
       8b 7d c4                mov    -0x3c(%rbp),%edi
       8b 75 b8                mov    -0x48(%rbp),%esi
       89 45 80                mov    %eax,-0x80(%rbp)
       65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
       48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       48 85 c0                test   %rax,%rax
       74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
       48 89 c3                mov    %rax,%rbx
       4c 8b 10                mov    (%rax),%r10
       89 f8                   mov    %edi,%eax
       48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
       89 f0                   mov    %esi,%eax
       48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
       89 c8                   mov    %ecx,%eax
       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
       89 d0                   mov    %edx,%eax
       48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
       8b 45 8c                mov    -0x74(%rbp),%eax
       48 8b 7b 08             mov    0x8(%rbx),%rdi
       48 83 c3 18             add    $0x18,%rbx
       50                      push   %rax
       41 54                   push   %r12
       41 55                   push   %r13
       ff b5 78 ff ff ff       pushq  -0x88(%rbp)
       41 56                   push   %r14
       41 57                   push   %r15
       ff b5 70 ff ff ff       pushq  -0x90(%rbp)
       4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
       4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       48 8b 55 90             mov    -0x70(%rbp),%rdx
       8b 75 80                mov    -0x80(%rbp),%esi
       41 ff d2                callq  *%r10
      
      After the patch:
      
       0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
       8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       4c 8b 6d a0             mov    -0x60(%rbp),%r13
       65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
       4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       4d 85 f6                test   %r14,%r14
       74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
       49 8b 06                mov    (%r14),%rax
       8b 4d 8c                mov    -0x74(%rbp),%ecx
       49 8b 7e 08             mov    0x8(%r14),%rdi
       49 83 c6 18             add    $0x18,%r14
       4c 89 ea                mov    %r13,%rdx
       45 89 e1                mov    %r12d,%r9d
       4c 8d 45 b8             lea    -0x48(%rbp),%r8
       89 de                   mov    %ebx,%esi
       51                      push   %rcx
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       ff d0                   callq  *%rax
      
      Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
      Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.homeSigned-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d51d1e64
  12. 11 4月, 2018 3 次提交
  13. 10 4月, 2018 3 次提交
    • D
      afs: Trace protocol errors · 5f702c8e
      David Howells 提交于
      Trace protocol errors detected in afs.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5f702c8e
    • D
      afs: Locally edit directory data for mkdir/create/unlink/... · 63a4681f
      David Howells 提交于
      Locally edit the contents of an AFS directory upon a successful inode
      operation that modifies that directory (such as mkdir, create and unlink)
      so that we can avoid the current practice of re-downloading the directory
      after each change.
      
      This is viable provided that the directory version number we get back from
      the modifying RPC op is exactly incremented by 1 from what we had
      previously.  The data in the directory contents is in a defined format that
      we have to parse locally to perform lookups and readdir, so modifying isn't
      a problem.
      
      If the edit fails, we just clear the VALID flag on the directory and it
      will be reloaded next time it is needed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      63a4681f
    • D
      afs: Prospectively look up extra files when doing a single lookup · 5cf9dd55
      David Howells 提交于
      When afs_lookup() is called, prospectively look up the next 50 uncached
      fids also from that same directory and cache the results, rather than just
      looking up the one file requested.
      
      This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency
      by fetching up to 50 file statuses at a time.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5cf9dd55
  14. 06 4月, 2018 4 次提交
  15. 04 4月, 2018 8 次提交
    • D
      fscache: Add more tracepoints · 08c2e3d0
      David Howells 提交于
      Add more tracepoints to fscache, including:
      
       (*) fscache_page - Tracks netfs pages known to fscache.
      
       (*) fscache_check_page - Tracks the netfs querying whether a page is
           pending storage.
      
       (*) fscache_wake_cookie - Tracks cookies being woken up after a page
           completes/aborts storage in the cache.
      
       (*) fscache_op - Tracks operations being initialised.
      
       (*) fscache_wrote_page - Tracks return of the backend write_page op.
      
       (*) fscache_gang_lookup - Tracks lookup of pages to be stored in the write
           operation.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      08c2e3d0
    • D
      fscache: Add tracepoints · a18feb55
      David Howells 提交于
      Add some tracepoints to fscache:
      
       (*) fscache_cookie - Tracks a cookie's usage count.
      
       (*) fscache_netfs - Logs registration of a network filesystem, including
           the pointer to the cookie allocated.
      
       (*) fscache_acquire - Logs cookie acquisition.
      
       (*) fscache_relinquish - Logs cookie relinquishment.
      
       (*) fscache_enable - Logs enablement of a cookie.
      
       (*) fscache_disable - Logs disablement of a cookie.
      
       (*) fscache_osm - Tracks execution of states in the object state machine.
      
      and cachefiles:
      
       (*) cachefiles_ref - Tracks a cachefiles object's usage count.
      
       (*) cachefiles_lookup - Logs result of lookup_one_len().
      
       (*) cachefiles_mkdir - Logs result of vfs_mkdir().
      
       (*) cachefiles_create - Logs result of vfs_create().
      
       (*) cachefiles_unlink - Logs calls to vfs_unlink().
      
       (*) cachefiles_rename - Logs calls to vfs_rename().
      
       (*) cachefiles_mark_active - Logs an object becoming active.
      
       (*) cachefiles_wait_active - Logs a wait for an old object to be
           destroyed.
      
       (*) cachefiles_mark_inactive - Logs an object becoming inactive.
      
       (*) cachefiles_mark_buried - Logs the burial of an object.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a18feb55
    • C
      svc: Report xprt dequeue latency · 55f5088c
      Chuck Lever 提交于
      Record the time between when a rqstp is enqueued on a transport
      and when it is dequeued. This includes how long the rqstp waits on
      the queue and how long it takes the kernel scheduler to wake a
      nfsd thread to service it.
      
      The svc_xprt_dequeue trace point is altered to include the number
      of microseconds between xprt_enqueue and xprt_dequeue.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      55f5088c
    • C
      sunrpc: Report per-RPC execution stats · aaba72cd
      Chuck Lever 提交于
      Introduce a mechanism to report the server-side execution latency of
      each RPC. The goal is to enable user space to filter the trace
      record for latency outliers, build histograms, etc.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      aaba72cd
    • C
      sunrpc: Re-purpose trace_svc_process · 0b9547bf
      Chuck Lever 提交于
      Currently, trace_svc_process has two call sites:
      
      1. Just after a call to svc_send. svc_send already invokes
         trace_svc_send with the same arguments just before returning
      
      2. Just before a call to svc_drop. svc_drop already invokes
         trace_svc_drop with the same arguments just after it is called
      
      Therefore trace_svc_process does not provide any additional
      information not already provided by these other trace points.
      
      However, it would be useful to record the incoming RPC procedure.
      So reuse trace_svc_process for this purpose.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      0b9547bf
    • C
      sunrpc: Save remote presentation address in svc_xprt for trace events · ece200dd
      Chuck Lever 提交于
      TP_printk defines a format string that is passed to user space for
      converting raw trace event records to something human-readable.
      
      My user space's printf (Oracle Linux 7), however, does not have a
      %pI format specifier. The result is that what is supposed to be an
      IP address in the output of "trace-cmd report" is just a string that
      says the field couldn't be displayed.
      
      To fix this, adopt the same approach as the client: maintain a pre-
      formated presentation address for occasions when %pI is not
      available.
      
      The location of the trace_svc_send trace point is adjusted so that
      rqst->rq_xprt is not NULL when the trace event is recorded.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ece200dd
    • C
      sunrpc: Simplify trace_svc_recv · 41f306d0
      Chuck Lever 提交于
      There doesn't seem to be a lot of value in calling trace_svc_recv
      in the failing case.
      
      1. There are two very common cases: one is the transport is not
      ready, and the other is shutdown. Neither is terribly interesting.
      
      2. The trace record for the failing case contains nothing but
      the status code.
      
      Therefore the trace point call site in the error exit is removed.
      Since the trace point is now recording a length instead of a
      status, rename the status field and remove the case that records a
      zero XID.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      41f306d0
    • C
      sunrpc: Move trace_svc_xprt_dequeue() · caa3e106
      Chuck Lever 提交于
      Reduce the amount of noise generated by trace_svc_xprt_dequeue by
      moving it to the end of svc_get_next_xprt. This generates exactly
      one trace event when a ready xprt is found, rather than spurious
      events when there is no work to do. The empty events contain no
      information that can't be obtained simply by tracing function calls
      to svc_xprt_dequeue.
      
      A small additional benefit is simplification of the svc_xprt_event
      trace class, which no longer has to handle the case when the @xprt
      parameter is NULL.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      caa3e106