1. 21 2月, 2019 3 次提交
    • E
      ext4: adjust reserved cluster count when removing extents · b2116b09
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      b2116b09
    • E
      ext4: fix reserved cluster accounting at delayed write time · 99490366
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      99490366
    • E
      ext4: generalize extents status tree search functions · 667f9459
      Eric Whitney 提交于
      commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.
      
      Ext4 contains a few functions that are used to search for delayed
      extents or blocks in the extents status tree.  Rather than duplicate
      code to add new functions to search for extents with different status
      values, such as written or a combination of delayed and unwritten,
      generalize the existing code to search for caller-specified extents
      status values.  Also, move this code into extents_status.c where it
      is better associated with the data structures it operates upon, and
      where it can be more readily used to implement new extents status tree
      functions that might want a broader scope for i_es_lock.
      
      Three missing static specifiers in RFC version of patch reported and
      fixed by Fengguang Wu <fengguang.wu@intel.com>.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      667f9459
  2. 17 1月, 2019 1 次提交
    • V
      sunrpc: use-after-free in svc_process_common() · 44e7bab3
      Vasily Averin 提交于
      commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream.
      
      if node have NFSv41+ mounts inside several net namespaces
      it can lead to use-after-free in svc_process_common()
      
      svc_process_common()
              /* Setup reply header */
              rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
      
      svc_process_common() can use incorrect rqstp->rq_xprt,
      its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
      The problem is that serv is global structure but sv_bc_xprt
      is assigned per-netnamespace.
      
      According to Trond, the whole "let's set up rqstp->rq_xprt
      for the back channel" is nothing but a giant hack in order
      to work around the fact that svc_process_common() uses it
      to find the xpt_ops, and perform a couple of (meaningless
      for the back channel) tests of xpt_flags.
      
      All we really need in svc_process_common() is to be able to run
      rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
      
      Bruce J Fields points that this xpo_prep_reply_hdr() call
      is an awfully roundabout way just to do "svc_putnl(resv, 0);"
      in the tcp case.
      
      This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
      now it calls svc_process_common() with rqstp->rq_xprt = NULL.
      
      To adjust reply header svc_process_common() just check
      rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
      
      To handle rqstp->rq_xprt = NULL case in functions called from
      svc_process_common() patch intruduces net namespace pointer
      svc_rqst->rq_bc_net and adjust SVC_NET() definition.
      Some other function was also adopted to properly handle described case.
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Cc: stable@vger.kernel.org
      Fixes: 23c20ecd ("NFS: callback up - users counting cleanup")
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      v2: added lost extern svc_tcp_prep_reply_hdr()
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44e7bab3
  3. 10 1月, 2019 1 次提交
    • T
      ext4: force inode writes when nfsd calls commit_metadata() · bf2fd1f9
      Theodore Ts'o 提交于
      commit fde872682e175743e0c3ef939c89e3c6008a1529 upstream.
      
      Some time back, nfsd switched from calling vfs_fsync() to using a new
      commit_metadata() hook in export_operations().  If the file system did
      not provide a commit_metadata() hook, it fell back to using
      sync_inode_metadata().  Unfortunately doesn't work on all file
      systems.  In particular, it doesn't work on ext4 due to how the inode
      gets journalled --- the VFS writeback code will not always call
      ext4_write_inode().
      
      So we need to provide our own ext4_nfs_commit_metdata() method which
      calls ext4_write_inode() directly.
      
      Google-Bug-Id: 121195940
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf2fd1f9
  4. 08 12月, 2018 1 次提交
  5. 09 10月, 2018 1 次提交
  6. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  7. 28 9月, 2018 1 次提交
    • D
      rxrpc: Fix error distribution · f3344303
      David Howells 提交于
      Fix error distribution by immediately delivering the errors to all the
      affected calls rather than deferring them to a worker thread.  The problem
      with the latter is that retries and things can happen in the meantime when we
      want to stop that sooner.
      
      To this end:
      
       (1) Stop the error distributor from removing calls from the error_targets
           list so that peer->lock isn't needed to synchronise against other adds
           and removals.
      
       (2) Require the peer's error_targets list to be accessed with RCU, thereby
           avoiding the need to take peer->lock over distribution.
      
       (3) Don't attempt to affect a call's state if it is already marked complete.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3344303
  8. 18 8月, 2018 1 次提交
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  9. 14 8月, 2018 1 次提交
    • Z
      net: Change the layout of structure trace_event_raw_fib_table_lookup · 0192e7d4
      Zong Li 提交于
      There is an unalignment access about the structure
      'trace_event_raw_fib_table_lookup'.
      
      In include/trace/events/fib.h, there is a memory operation which casting
      the 'src' data member to a pointer, and then store a value to this
      pointer point to.
      
      p32 = (__be32 *) __entry->src;
      *p32 = flp->saddr;
      
      The offset of 'src' in structure trace_event_raw_fib_table_lookup is not
      four bytes alignment. On some architectures, they don't permit the
      unalignment access, it need to pay the price to handle this situation in
      exception handler.
      
      Adjust the layout of structure to avoid this case.
      
      Fixes: 9f323973 ("net/ipv4: Udate fib_table_lookup tracepoint")
      Signed-off-by: NZong Li <zong@andestech.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0192e7d4
  10. 07 8月, 2018 1 次提交
  11. 06 8月, 2018 2 次提交
  12. 01 8月, 2018 3 次提交
  13. 31 7月, 2018 1 次提交
    • J
      tracing: Centralize preemptirq tracepoints and unify their usage · c3bc8fd6
      Joel Fernandes (Google) 提交于
      This patch detaches the preemptirq tracepoints from the tracers and
      keeps it separate.
      
      Advantages:
      * Lockdep and irqsoff event can now run in parallel since they no longer
      have their own calls.
      
      * This unifies the usecase of adding hooks to an irqsoff and irqson
      event, and a preemptoff and preempton event.
        3 users of the events exist:
        - Lockdep
        - irqsoff and preemptoff tracers
        - irqs and preempt trace events
      
      The unification cleans up several ifdefs and makes the code in preempt
      tracer and irqsoff tracers simpler. It gets rid of all the horrific
      ifdeferry around PROVE_LOCKING and makes configuration of the different
      users of the tracepoints more easy and understandable. It also gets rid
      of the time_* function calls from the lockdep hooks used to call into
      the preemptirq tracer which is not needed anymore. The negative delta in
      lines of code in this patch is quite large too.
      
      In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
      as a single point for registering probes onto the tracepoints. With
      this,
      the web of config options for preempt/irq toggle tracepoints and its
      users becomes:
      
       PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
             |                 |     \         |           |
             \    (selects)    /      \        \ (selects) /
            TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                            \                  /
                             \ (depends on)   /
                           PREEMPTIRQ_TRACEPOINTS
      
      Other than the performance tests mentioned in the previous patch, I also
      ran the locking API test suite. I verified that all tests cases are
      passing.
      
      I also injected issues by not registering lockdep probes onto the
      tracepoints and I see failures to confirm that the probes are indeed
      working.
      
      This series + lockdep probes not registered (just to inject errors):
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      With this series + lockdep probes registered, all locking tests pass:
      
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.orgAcked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c3bc8fd6
  14. 26 7月, 2018 1 次提交
  15. 23 7月, 2018 1 次提交
  16. 13 7月, 2018 10 次提交
  17. 12 7月, 2018 2 次提交
    • B
      fsi: master-gpio: Add more tracepoints · 777fd524
      Benjamin Herrenschmidt 提交于
      This adds a few more tracepoints that have proven useful when
      debugging issues with the FSI bus.
      
      This also makes echo_delay() use clock_zeros() instead of
      open-code it in order to share the tracepoint.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NJoel Stanley <joel@jms.id.au>
      777fd524
    • S
      cgroup/tracing: Move taking of spin lock out of trace event handlers · e4f8d81c
      Steven Rostedt (VMware) 提交于
      It is unwise to take spin locks from the handlers of trace events.
      Mainly, because they can introduce lockups, because it introduces locks
      in places that are normally not tested. Worse yet, because trace events
      are tucked away in the include/trace/events/ directory, locks that are
      taken there are forgotten about.
      
      As a general rule, I tell people never to take any locks in a trace
      event handler.
      
      Several cgroup trace event handlers call cgroup_path() which eventually
      takes the kernfs_rename_lock spinlock. This injects the spinlock in the
      code without people realizing it. It also can cause issues for the
      PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
      handlers are called with preemption disabled.
      
      By moving the calculation of the cgroup_path() out of the trace event
      handlers and into a macro (surrounded by a
      trace_cgroup_##type##_enabled()), then we could place the cgroup_path
      into a string, and pass that to the trace event. Not only does this
      remove the taking of the spinlock out of the trace event handler, but
      it also means that the cgroup_path() only needs to be called once (it
      is currently called twice, once to get the length to reserver the
      buffer for, and once again to get the path itself. Now it only needs to
      be done once.
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4f8d81c
  18. 04 7月, 2018 1 次提交
  19. 02 7月, 2018 1 次提交
  20. 20 6月, 2018 1 次提交
    • J
      clk: add duty cycle support · 9fba738a
      Jerome Brunet 提交于
      Add the possibility to apply and query the clock signal duty cycle ratio.
      
      This is useful when the duty cycle of the clock signal depends on some
      other parameters controlled by the clock framework.
      
      For example, the duty cycle of a divider may depends on the raw divider
      setting (ratio = N / div) , which is controlled by the CCF. In such case,
      going through the pwm framework to control the duty cycle ratio of this
      clock would be a burden.
      
      A clock provider is not required to implement the operation to set and get
      the duty cycle. If it does not implement .get_duty_cycle(), the ratio is
      assumed to be 50%.
      
      This change also adds a new flag, CLK_DUTY_CYCLE_PARENT. This flag should
      be used to indicate that a clock, such as gates and muxes, may inherit
      the duty cycle ratio of its parent clock. If a clock does not provide a
      get_duty_cycle() callback and has CLK_DUTY_CYCLE_PARENT, then the call
      will be directly forwarded to its parent clock, if any. For
      set_duty_cycle(), the clock should also have CLK_SET_RATE_PARENT for the
      call to be forwarded
      Signed-off-by: NJerome Brunet <jbrunet@baylibre.com>
      Signed-off-by: NMichael Turquette <mturquette@baylibre.com>
      Link: lkml.kernel.org/r/20180619144141.8506-1-jbrunet@baylibre.com
      9fba738a
  21. 12 6月, 2018 2 次提交
  22. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  23. 05 6月, 2018 1 次提交
    • D
      rxrpc: Fix handling of call quietly cancelled out on server · 1a025028
      David Howells 提交于
      Sometimes an in-progress call will stop responding on the fileserver when
      the fileserver quietly cancels the call with an internally marked abort
      (RX_CALL_DEAD), without sending an ABORT to the client.
      
      This causes the client's call to eventually expire from lack of incoming
      packets directed its way, which currently leads to it being cancelled
      locally with ETIME.  Note that it's not currently clear as to why this
      happens as it's really hard to reproduce.
      
      The rotation policy implement by kAFS, however, doesn't differentiate
      between ETIME meaning we didn't get any response from the server and ETIME
      meaning the call got cancelled mid-flow.  The latter leads to an oops when
      fetching data as the rotation partially resets the afs_read descriptor,
      which can result in a cleared page pointer being dereferenced because that
      page has already been filled.
      
      Handle this by the following means:
      
       (1) Set a flag on a call when we receive a packet for it.
      
       (2) Store the highest packet serial number so far received for a call
           (bearing in mind this may wrap).
      
       (3) If, when the "not received anything recently" timeout expires on a
           call, we've received at least one packet for a call and the connection
           as a whole has received packets more recently than that call, then
           cancel the call locally with ECONNRESET rather than ETIME.
      
           This indicates that the call was definitely in progress on the server.
      
       (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
           don't try the next server, but rather abort the call.
      
           This avoids the oops as we don't try to reuse the afs_read struct.
           Rather, as-yet ungotten pages will be reread at a later data.
      
      Also:
      
       (5) Add an rxrpc tracepoint to log detection of the call being reset.
      
      Without this, I occasionally see an oops like the following:
      
          general protection fault: 0000 [#1] SMP PTI
          ...
          RIP: 0010:_copy_to_iter+0x204/0x310
          RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
          RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
          RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
          RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
          R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
          R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
          FS:  00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
          Call Trace:
           skb_copy_datagram_iter+0x14e/0x289
           rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
           ? trace_buffer_unlock_commit_regs+0x4f/0x89
           rxrpc_kernel_recv_data+0x149/0x421
           afs_extract_data+0x1e0/0x798
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_deliver_fs_fetch_data+0x33a/0x5ab
           afs_deliver_to_call+0x1ee/0x5e0
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_wait_for_call_to_complete+0x12b/0x52e
           ? wake_up_q+0x54/0x54
           afs_make_call+0x287/0x462
           ? afs_fs_fetch_data+0x3e6/0x3ed
           ? rcu_read_lock_sched_held+0x5d/0x63
           afs_fs_fetch_data+0x3e6/0x3ed
           afs_fetch_data+0xbb/0x14a
           afs_readpages+0x317/0x40d
           __do_page_cache_readahead+0x203/0x2ba
           ? ondemand_readahead+0x3a7/0x3c1
           ondemand_readahead+0x3a7/0x3c1
           generic_file_buffered_read+0x18b/0x62f
           __vfs_read+0xdb/0xfe
           vfs_read+0xb2/0x137
           ksys_read+0x50/0x8c
           do_syscall_64+0x7d/0x1a0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note the weird value in RDI which is a result of trying to kmap() a NULL
      page pointer.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a025028
  24. 02 6月, 2018 1 次提交