1. 08 10月, 2019 1 次提交
  2. 29 8月, 2019 1 次提交
    • D
      rxrpc: Fix read-after-free in rxrpc_queue_local() · a05354cb
      David Howells 提交于
      commit 06d9532fa6b34f12a6d75711162d47c17c1add72 upstream.
      
      rxrpc_queue_local() attempts to queue the local endpoint it is given and
      then, if successful, prints a trace line.  The trace line includes the
      current usage count - but we're not allowed to look at the local endpoint
      at this point as we passed our ref on it to the workqueue.
      
      Fix this by reading the usage count before queuing the work item.
      
      Also fix the reading of local->debug_id for trace lines, which must be done
      with the same consideration as reading the usage count.
      
      Fixes: 09d2bf59 ("rxrpc: Add a tracepoint to track rxrpc_local refcounting")
      Reported-by: syzbot+78e71c5bab4f76a6a719@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a05354cb
  3. 26 7月, 2019 1 次提交
    • D
      rxrpc: Fix oops in tracepoint · 0f2f2ceb
      David Howells 提交于
      [ Upstream commit 99f0eae653b2db64917d0b58099eb51e300b311d ]
      
      If the rxrpc_eproto tracepoint is enabled, an oops will be cause by the
      trace line that rxrpc_extract_header() tries to emit when a protocol error
      occurs (typically because the packet is short) because the call argument is
      NULL.
      
      Fix this by using ?: to assume 0 as the debug_id if call is NULL.
      
      This can then be induced by:
      
      	echo -e '\0\0\0\0\0\0\0\0' | ncat -4u --send-only <addr> 20001
      
      where addr has the following program running on it:
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <unistd.h>
      	#include <sys/socket.h>
      	#include <arpa/inet.h>
      	#include <linux/rxrpc.h>
      	int main(void)
      	{
      		struct sockaddr_rxrpc srx;
      		int fd;
      		memset(&srx, 0, sizeof(srx));
      		srx.srx_family			= AF_RXRPC;
      		srx.srx_service			= 0;
      		srx.transport_type		= AF_INET;
      		srx.transport_len		= sizeof(srx.transport.sin);
      		srx.transport.sin.sin_family	= AF_INET;
      		srx.transport.sin.sin_port	= htons(0x4e21);
      		fd = socket(AF_RXRPC, SOCK_DGRAM, AF_INET6);
      		bind(fd, (struct sockaddr *)&srx, sizeof(srx));
      		sleep(20);
      		return 0;
      	}
      
      It results in the following oops.
      
      	BUG: kernel NULL pointer dereference, address: 0000000000000340
      	#PF: supervisor read access in kernel mode
      	#PF: error_code(0x0000) - not-present page
      	...
      	RIP: 0010:trace_event_raw_event_rxrpc_rx_eproto+0x47/0xac
      	...
      	Call Trace:
      	 <IRQ>
      	 rxrpc_extract_header+0x86/0x171
      	 ? rcu_read_lock_sched_held+0x5d/0x63
      	 ? rxrpc_new_skb+0xd4/0x109
      	 rxrpc_input_packet+0xef/0x14fc
      	 ? rxrpc_input_data+0x986/0x986
      	 udp_queue_rcv_one_skb+0xbf/0x3d0
      	 udp_unicast_rcv_skb.isra.8+0x64/0x71
      	 ip_protocol_deliver_rcu+0xe4/0x1b4
      	 ip_local_deliver+0xf0/0x154
      	 __netif_receive_skb_one_core+0x50/0x6c
      	 netif_receive_skb_internal+0x26b/0x2e9
      	 napi_gro_receive+0xf8/0x1da
      	 rtl8169_poll+0x303/0x4c4
      	 net_rx_action+0x10e/0x333
      	 __do_softirq+0x1a5/0x38f
      	 irq_exit+0x54/0xc4
      	 do_IRQ+0xda/0xf8
      	 common_interrupt+0xf/0xf
      	 </IRQ>
      	 ...
      	 ? cpuidle_enter_state+0x23c/0x34d
      	 cpuidle_enter+0x2a/0x36
      	 do_idle+0x163/0x1ea
      	 cpu_startup_entry+0x1d/0x1f
      	 start_secondary+0x157/0x172
      	 secondary_startup_64+0xa4/0xb0
      
      Fixes: a25e21f0 ("rxrpc, afs: Use debug_ids rather than pointers in traces")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f2f2ceb
  4. 20 4月, 2019 1 次提交
    • D
      rxrpc: Fix client call connect/disconnect race · 11582064
      David Howells 提交于
      [ Upstream commit 930c9f9125c85b5134b3e711bc252ecc094708e3 ]
      
      rxrpc_disconnect_client_call() reads the call's connection ID protocol
      value (call->cid) as part of that function's variable declarations.  This
      is bad because it's not inside the locked section and so may race with
      someone granting use of the channel to the call.
      
      This manifests as an assertion failure (see below) where the call in the
      presumed channel (0 because call->cid wasn't set when we read it) doesn't
      match the call attached to the channel we were actually granted (if 1, 2 or
      3).
      
      Fix this by moving the read and dependent calculations inside of the
      channel_lock section.  Also, only set the channel number and pointer
      variables if cid is not zero (ie. unset).
      
      This problem can be induced by injecting an occasional error in
      rxrpc_wait_for_channel() before the call to schedule().
      
      Make two further changes also:
      
       (1) Add a trace for wait failure in rxrpc_connect_call().
      
       (2) Drop channel_lock before BUG'ing in the case of the assertion failure.
      
      The failure causes a trace akin to the following:
      
      rxrpc: Assertion failed - 18446612685268945920(0xffff8880beab8c00) == 18446612685268621312(0xffff8880bea69800) is false
      ------------[ cut here ]------------
      kernel BUG at net/rxrpc/conn_client.c:824!
      ...
      RIP: 0010:rxrpc_disconnect_client_call+0x2bf/0x99d
      ...
      Call Trace:
       rxrpc_connect_call+0x902/0x9b3
       ? wake_up_q+0x54/0x54
       rxrpc_new_client_call+0x3a0/0x751
       ? rxrpc_kernel_begin_call+0x141/0x1bc
       ? afs_alloc_call+0x1b5/0x1b5
       rxrpc_kernel_begin_call+0x141/0x1bc
       afs_make_call+0x20c/0x525
       ? afs_alloc_call+0x1b5/0x1b5
       ? __lock_is_held+0x40/0x71
       ? lockdep_init_map+0xaf/0x193
       ? lockdep_init_map+0xaf/0x193
       ? __lock_is_held+0x40/0x71
       ? yfs_fs_fetch_data+0x33b/0x34a
       yfs_fs_fetch_data+0x33b/0x34a
       afs_fetch_data+0xdc/0x3b7
       afs_read_dir+0x52d/0x97f
       afs_dir_iterate+0xa0/0x661
       ? iterate_dir+0x63/0x141
       iterate_dir+0xa2/0x141
       ksys_getdents64+0x9f/0x11b
       ? filldir+0x111/0x111
       ? do_syscall_64+0x3e/0x1a0
       __x64_sys_getdents64+0x16/0x19
       do_syscall_64+0x7d/0x1a0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 45025bce ("rxrpc: Improve management and caching of client connection objects")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      11582064
  5. 17 1月, 2019 1 次提交
    • V
      sunrpc: use-after-free in svc_process_common() · 44e7bab3
      Vasily Averin 提交于
      commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream.
      
      if node have NFSv41+ mounts inside several net namespaces
      it can lead to use-after-free in svc_process_common()
      
      svc_process_common()
              /* Setup reply header */
              rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
      
      svc_process_common() can use incorrect rqstp->rq_xprt,
      its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
      The problem is that serv is global structure but sv_bc_xprt
      is assigned per-netnamespace.
      
      According to Trond, the whole "let's set up rqstp->rq_xprt
      for the back channel" is nothing but a giant hack in order
      to work around the fact that svc_process_common() uses it
      to find the xpt_ops, and perform a couple of (meaningless
      for the back channel) tests of xpt_flags.
      
      All we really need in svc_process_common() is to be able to run
      rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
      
      Bruce J Fields points that this xpo_prep_reply_hdr() call
      is an awfully roundabout way just to do "svc_putnl(resv, 0);"
      in the tcp case.
      
      This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
      now it calls svc_process_common() with rqstp->rq_xprt = NULL.
      
      To adjust reply header svc_process_common() just check
      rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
      
      To handle rqstp->rq_xprt = NULL case in functions called from
      svc_process_common() patch intruduces net namespace pointer
      svc_rqst->rq_bc_net and adjust SVC_NET() definition.
      Some other function was also adopted to properly handle described case.
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Cc: stable@vger.kernel.org
      Fixes: 23c20ecd ("NFS: callback up - users counting cleanup")
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      v2: added lost extern svc_tcp_prep_reply_hdr()
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44e7bab3
  6. 10 1月, 2019 1 次提交
    • T
      ext4: force inode writes when nfsd calls commit_metadata() · bf2fd1f9
      Theodore Ts'o 提交于
      commit fde872682e175743e0c3ef939c89e3c6008a1529 upstream.
      
      Some time back, nfsd switched from calling vfs_fsync() to using a new
      commit_metadata() hook in export_operations().  If the file system did
      not provide a commit_metadata() hook, it fell back to using
      sync_inode_metadata().  Unfortunately doesn't work on all file
      systems.  In particular, it doesn't work on ext4 due to how the inode
      gets journalled --- the VFS writeback code will not always call
      ext4_write_inode().
      
      So we need to provide our own ext4_nfs_commit_metdata() method which
      calls ext4_write_inode() directly.
      
      Google-Bug-Id: 121195940
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf2fd1f9
  7. 08 12月, 2018 1 次提交
  8. 09 10月, 2018 1 次提交
  9. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  10. 28 9月, 2018 1 次提交
    • D
      rxrpc: Fix error distribution · f3344303
      David Howells 提交于
      Fix error distribution by immediately delivering the errors to all the
      affected calls rather than deferring them to a worker thread.  The problem
      with the latter is that retries and things can happen in the meantime when we
      want to stop that sooner.
      
      To this end:
      
       (1) Stop the error distributor from removing calls from the error_targets
           list so that peer->lock isn't needed to synchronise against other adds
           and removals.
      
       (2) Require the peer's error_targets list to be accessed with RCU, thereby
           avoiding the need to take peer->lock over distribution.
      
       (3) Don't attempt to affect a call's state if it is already marked complete.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3344303
  11. 18 8月, 2018 1 次提交
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  12. 14 8月, 2018 1 次提交
    • Z
      net: Change the layout of structure trace_event_raw_fib_table_lookup · 0192e7d4
      Zong Li 提交于
      There is an unalignment access about the structure
      'trace_event_raw_fib_table_lookup'.
      
      In include/trace/events/fib.h, there is a memory operation which casting
      the 'src' data member to a pointer, and then store a value to this
      pointer point to.
      
      p32 = (__be32 *) __entry->src;
      *p32 = flp->saddr;
      
      The offset of 'src' in structure trace_event_raw_fib_table_lookup is not
      four bytes alignment. On some architectures, they don't permit the
      unalignment access, it need to pay the price to handle this situation in
      exception handler.
      
      Adjust the layout of structure to avoid this case.
      
      Fixes: 9f323973 ("net/ipv4: Udate fib_table_lookup tracepoint")
      Signed-off-by: NZong Li <zong@andestech.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0192e7d4
  13. 07 8月, 2018 1 次提交
  14. 06 8月, 2018 2 次提交
  15. 01 8月, 2018 3 次提交
  16. 31 7月, 2018 1 次提交
    • J
      tracing: Centralize preemptirq tracepoints and unify their usage · c3bc8fd6
      Joel Fernandes (Google) 提交于
      This patch detaches the preemptirq tracepoints from the tracers and
      keeps it separate.
      
      Advantages:
      * Lockdep and irqsoff event can now run in parallel since they no longer
      have their own calls.
      
      * This unifies the usecase of adding hooks to an irqsoff and irqson
      event, and a preemptoff and preempton event.
        3 users of the events exist:
        - Lockdep
        - irqsoff and preemptoff tracers
        - irqs and preempt trace events
      
      The unification cleans up several ifdefs and makes the code in preempt
      tracer and irqsoff tracers simpler. It gets rid of all the horrific
      ifdeferry around PROVE_LOCKING and makes configuration of the different
      users of the tracepoints more easy and understandable. It also gets rid
      of the time_* function calls from the lockdep hooks used to call into
      the preemptirq tracer which is not needed anymore. The negative delta in
      lines of code in this patch is quite large too.
      
      In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
      as a single point for registering probes onto the tracepoints. With
      this,
      the web of config options for preempt/irq toggle tracepoints and its
      users becomes:
      
       PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
             |                 |     \         |           |
             \    (selects)    /      \        \ (selects) /
            TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                            \                  /
                             \ (depends on)   /
                           PREEMPTIRQ_TRACEPOINTS
      
      Other than the performance tests mentioned in the previous patch, I also
      ran the locking API test suite. I verified that all tests cases are
      passing.
      
      I also injected issues by not registering lockdep probes onto the
      tracepoints and I see failures to confirm that the probes are indeed
      working.
      
      This series + lockdep probes not registered (just to inject errors):
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      With this series + lockdep probes registered, all locking tests pass:
      
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.orgAcked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c3bc8fd6
  17. 26 7月, 2018 1 次提交
  18. 23 7月, 2018 1 次提交
  19. 13 7月, 2018 10 次提交
  20. 12 7月, 2018 2 次提交
    • B
      fsi: master-gpio: Add more tracepoints · 777fd524
      Benjamin Herrenschmidt 提交于
      This adds a few more tracepoints that have proven useful when
      debugging issues with the FSI bus.
      
      This also makes echo_delay() use clock_zeros() instead of
      open-code it in order to share the tracepoint.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NJoel Stanley <joel@jms.id.au>
      777fd524
    • S
      cgroup/tracing: Move taking of spin lock out of trace event handlers · e4f8d81c
      Steven Rostedt (VMware) 提交于
      It is unwise to take spin locks from the handlers of trace events.
      Mainly, because they can introduce lockups, because it introduces locks
      in places that are normally not tested. Worse yet, because trace events
      are tucked away in the include/trace/events/ directory, locks that are
      taken there are forgotten about.
      
      As a general rule, I tell people never to take any locks in a trace
      event handler.
      
      Several cgroup trace event handlers call cgroup_path() which eventually
      takes the kernfs_rename_lock spinlock. This injects the spinlock in the
      code without people realizing it. It also can cause issues for the
      PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
      handlers are called with preemption disabled.
      
      By moving the calculation of the cgroup_path() out of the trace event
      handlers and into a macro (surrounded by a
      trace_cgroup_##type##_enabled()), then we could place the cgroup_path
      into a string, and pass that to the trace event. Not only does this
      remove the taking of the spinlock out of the trace event handler, but
      it also means that the cgroup_path() only needs to be called once (it
      is currently called twice, once to get the length to reserver the
      buffer for, and once again to get the path itself. Now it only needs to
      be done once.
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4f8d81c
  21. 04 7月, 2018 1 次提交
  22. 02 7月, 2018 1 次提交
  23. 20 6月, 2018 1 次提交
    • J
      clk: add duty cycle support · 9fba738a
      Jerome Brunet 提交于
      Add the possibility to apply and query the clock signal duty cycle ratio.
      
      This is useful when the duty cycle of the clock signal depends on some
      other parameters controlled by the clock framework.
      
      For example, the duty cycle of a divider may depends on the raw divider
      setting (ratio = N / div) , which is controlled by the CCF. In such case,
      going through the pwm framework to control the duty cycle ratio of this
      clock would be a burden.
      
      A clock provider is not required to implement the operation to set and get
      the duty cycle. If it does not implement .get_duty_cycle(), the ratio is
      assumed to be 50%.
      
      This change also adds a new flag, CLK_DUTY_CYCLE_PARENT. This flag should
      be used to indicate that a clock, such as gates and muxes, may inherit
      the duty cycle ratio of its parent clock. If a clock does not provide a
      get_duty_cycle() callback and has CLK_DUTY_CYCLE_PARENT, then the call
      will be directly forwarded to its parent clock, if any. For
      set_duty_cycle(), the clock should also have CLK_SET_RATE_PARENT for the
      call to be forwarded
      Signed-off-by: NJerome Brunet <jbrunet@baylibre.com>
      Signed-off-by: NMichael Turquette <mturquette@baylibre.com>
      Link: lkml.kernel.org/r/20180619144141.8506-1-jbrunet@baylibre.com
      9fba738a
  24. 12 6月, 2018 2 次提交
  25. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  26. 05 6月, 2018 1 次提交
    • D
      rxrpc: Fix handling of call quietly cancelled out on server · 1a025028
      David Howells 提交于
      Sometimes an in-progress call will stop responding on the fileserver when
      the fileserver quietly cancels the call with an internally marked abort
      (RX_CALL_DEAD), without sending an ABORT to the client.
      
      This causes the client's call to eventually expire from lack of incoming
      packets directed its way, which currently leads to it being cancelled
      locally with ETIME.  Note that it's not currently clear as to why this
      happens as it's really hard to reproduce.
      
      The rotation policy implement by kAFS, however, doesn't differentiate
      between ETIME meaning we didn't get any response from the server and ETIME
      meaning the call got cancelled mid-flow.  The latter leads to an oops when
      fetching data as the rotation partially resets the afs_read descriptor,
      which can result in a cleared page pointer being dereferenced because that
      page has already been filled.
      
      Handle this by the following means:
      
       (1) Set a flag on a call when we receive a packet for it.
      
       (2) Store the highest packet serial number so far received for a call
           (bearing in mind this may wrap).
      
       (3) If, when the "not received anything recently" timeout expires on a
           call, we've received at least one packet for a call and the connection
           as a whole has received packets more recently than that call, then
           cancel the call locally with ECONNRESET rather than ETIME.
      
           This indicates that the call was definitely in progress on the server.
      
       (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
           don't try the next server, but rather abort the call.
      
           This avoids the oops as we don't try to reuse the afs_read struct.
           Rather, as-yet ungotten pages will be reread at a later data.
      
      Also:
      
       (5) Add an rxrpc tracepoint to log detection of the call being reset.
      
      Without this, I occasionally see an oops like the following:
      
          general protection fault: 0000 [#1] SMP PTI
          ...
          RIP: 0010:_copy_to_iter+0x204/0x310
          RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
          RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
          RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
          RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
          R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
          R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
          FS:  00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
          Call Trace:
           skb_copy_datagram_iter+0x14e/0x289
           rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
           ? trace_buffer_unlock_commit_regs+0x4f/0x89
           rxrpc_kernel_recv_data+0x149/0x421
           afs_extract_data+0x1e0/0x798
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_deliver_fs_fetch_data+0x33a/0x5ab
           afs_deliver_to_call+0x1ee/0x5e0
           ? afs_wait_for_call_to_complete+0xc9/0x52e
           afs_wait_for_call_to_complete+0x12b/0x52e
           ? wake_up_q+0x54/0x54
           afs_make_call+0x287/0x462
           ? afs_fs_fetch_data+0x3e6/0x3ed
           ? rcu_read_lock_sched_held+0x5d/0x63
           afs_fs_fetch_data+0x3e6/0x3ed
           afs_fetch_data+0xbb/0x14a
           afs_readpages+0x317/0x40d
           __do_page_cache_readahead+0x203/0x2ba
           ? ondemand_readahead+0x3a7/0x3c1
           ondemand_readahead+0x3a7/0x3c1
           generic_file_buffered_read+0x18b/0x62f
           __vfs_read+0xdb/0xfe
           vfs_read+0xb2/0x137
           ksys_read+0x50/0x8c
           do_syscall_64+0x7d/0x1a0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note the weird value in RDI which is a result of trying to kmap() a NULL
      page pointer.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a025028