1. 16 11月, 2017 5 次提交
  2. 13 11月, 2017 5 次提交
  3. 12 11月, 2017 3 次提交
    • D
      timers: Add a function to start/reduce a timer · b24591e2
      David Howells 提交于
      Add a function, similar to mod_timer(), that will start a timer if it isn't
      running and will modify it if it is running and has an expiry time longer
      than the new time.  If the timer is running with an expiry time that's the
      same or sooner, no change is made.
      
      The function looks like:
      
      	int timer_reduce(struct timer_list *timer, unsigned long expires);
      
      This can be used by code such as networking code to make it easier to share
      a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
      the rxrpc_call struct will maintain a number of timeouts:
      
      	unsigned long	ack_at;
      	unsigned long	resend_at;
      	unsigned long	ping_at;
      	unsigned long	expect_rx_by;
      	unsigned long	expect_req_by;
      	unsigned long	expect_term_by;
      
      each of which is set independently of the others.  With timer reduction
      available, when the code needs to set one of the timeouts, it only needs to
      look at that timeout and then call timer_reduce() to modify the timer,
      starting it or bringing it forward if necessary.  There is no need to refer
      to the other timeouts to see which is earliest and no need to take any lock
      other than, potentially, the timer lock inside timer_reduce().
      
      Note, that this does not protect against concurrent invocations of any of
      the timer functions.
      
      As an example, the expect_rx_by timeout above, which terminates a call if
      we don't get a packet from the server within a certain time window, would
      be set something like this:
      
      	unsigned long now = jiffies;
      	unsigned long expect_rx_by = now + packet_receive_timeout;
      	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
      	timer_reduce(&call->timer, expect_rx_by);
      
      The timer service code (which might, say, be in a work function) would then
      check all the timeouts to see which, if any, had triggered, deal with
      those:
      
      	t = READ_ONCE(call->ack_at);
      	if (time_after_eq(now, t)) {
      		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
      		set_bit(RXRPC_CALL_EV_ACK, &call->events);
      	}
      
      and then restart the timer if necessary by finding the soonest timeout that
      hasn't yet passed and then calling timer_reduce().
      
      The disadvantage of doing things this way rather than comparing the timers
      each time and calling mod_timer() is that you *will* take timer events
      unless you can finish what you're doing and delete the timer in time.
      
      The advantage of doing things this way is that you don't need to use a lock
      to work out when the next timer should be set, other than the timer's own
      lock - which you might not have to take.
      
      [ tglx: Fixed weird formatting and adopted it to pending changes ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keyrings@vger.kernel.org
      Cc: linux-afs@lists.infradead.org
      Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
      b24591e2
    • A
      pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() · df27067e
      Arnd Bergmann 提交于
      __getnstimeofday() is a rather odd interface, with a number of quirks:
      
      - The caller may come from NMI context, but the implementation is not NMI safe,
        one way to get there from NMI is
      
            NMI handler:
              something bad
                panic()
                  kmsg_dump()
                    pstore_dump()
                       pstore_record_init()
                         __getnstimeofday()
      
      - The calling conventions are different from any other timekeeping functions,
        to deal with returning an error code during suspended timekeeping.
      
      Address the above issues by using a completely different method to get the
      time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
      when timekeeping is suspended: it returns the time at which it got
      suspended. As Thomas Gleixner explained, this is safe, as
      ktime_get_real_fast_ns() does not call into the clocksource driver that
      might be suspended.
      
      The result can easily be transformed into a timespec structure. Since
      ktime_get_real_fast_ns() was not exported to modules, add the export.
      
      The pstore behavior for the suspended case changes slightly, as it now
      stores the timestamp at which timekeeping was suspended instead of storing
      a zero timestamp.
      
      This change is not addressing y2038-safety, that's subject to a more
      complex follow up patch.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Colin Cross <ccross@android.com>
      Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
      df27067e
    • T
      irq/work: Use llist_for_each_entry_safe · d00a08cf
      Thomas Gleixner 提交于
      The llist_for_each_entry() loop in irq_work_run_list() is unsafe because
      once the works PENDING bit is cleared it can be requeued on another CPU.
      
      Use llist_for_each_entry_safe() instead.
      
      Fixes: 16c0890d ("irq/work: Don't reinvent the wheel but use existing llist API")
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petri Latvala <petri.latvala@intel.com>
      Link: http://lkml.kernel.org/r/151027307351.14762.4611888896020658384@mail.alporthouse.com
      d00a08cf
  4. 11 11月, 2017 3 次提交
    • S
      kthread: zero the kthread data structure · e10237cc
      Shaohua Li 提交于
      kthread() could bail out early before we initialize blkcg_css (if the
      kthread is killed very early. Please see xchg() statement in kthread()),
      which confuses free_kthread_struct. Instead of moving the blkcg_css
      initialization early, we simply zero the whole 'self' data structure,
      which doesn't sound much overhead.
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Fixes: 05e3db95 ("kthread: add a mechanism to store cgroup info")
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e10237cc
    • J
      blktrace: fix unlocked registration of tracepoints · a6da0024
      Jens Axboe 提交于
      We need to ensure that tracepoints are registered and unregistered
      with the users of them. The existing atomic count isn't enough for
      that. Add a lock around the tracepoints, so we serialize access
      to them.
      
      This fixes cases where we have multiple users setting up and
      tearing down tracepoints, like this:
      
      CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted
      4.14.0-rc5-next-20171018+ #36
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:16 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:52
        panic+0x1e4/0x41c kernel/panic.c:183
        __warn+0x1c4/0x1e0 kernel/panic.c:546
        report_bug+0x211/0x2d0 lib/bug.c:183
        fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
        do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
        do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
        do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
        do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
        invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline]
      RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283
      RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293
      RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07
      RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818
      RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004
      R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0
      R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0
        tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304
        register_trace_block_rq_insert include/trace/events/block.h:191 [inline]
        blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043
        do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542
        blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564
        sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089
        vfs_ioctl fs/ioctl.c:45 [inline]
        do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
        SYSC_ioctl fs/ioctl.c:700 [inline]
        SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
        entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x444339
      RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339
      RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009
      RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
      R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000
      
      since we can now run these in parallel. Ensure that the exported helpers
      for doing this are grabbing the queue trace mutex.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a6da0024
    • J
      blktrace: fix unlocked access to init/start-stop/teardown · 1f2cac10
      Jens Axboe 提交于
      sg.c calls into the blktrace functions without holding the proper queue
      mutex for doing setup, start/stop, or teardown.
      
      Add internal unlocked variants, and export the ones that do the proper
      locking.
      
      Fixes: 6da127ad ("blktrace: Add blktrace ioctls to SCSI generic devices")
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f2cac10
  5. 09 11月, 2017 5 次提交
    • P
      sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds · 765cc3a4
      Patrick Bellasi 提交于
      When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
      all SCHED_FEAT are turned into compile time constants being propagated
      to support compiler optimizations.
      
      Specifically, we expect that code blocks like this:
      
         if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
      	/* FEATURE CODE */
         }
      
      are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
      being removed by the compiler from the finale image.
      
      For this mechanism to properly work it's required for the compiler to
      have full access, from each translation unit, to whatever is the value
      defined by the sched_feat macro. This macro is defined as:
      
         #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
      
      and thus, the compiler can optimize that code only if the value of
      sysctl_sched_features is visible within each translation unit.
      
      Since:
      
         029632fb ("sched: Make separate sched*.c translation units")
      
      the scheduler code has been split into separate translation units
      however the definition of sysctl_sched_features is part of
      kernel/sched/core.c while, for all the other scheduler modules, it is
      visible only via kernel/sched/sched.h as an:
      
         extern const_debug unsigned int sysctl_sched_features
      
      Unfortunately, an extern reference does not allow the compiler to apply
      constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
      with code to load a memory reference and (eventually) doing an unconditional
      jump of a chunk of code.
      
      This mechanism is unavoidable when sched_features can be turned on and off at
      run-time. However, this is not the case for "production" kernels compiled with
      !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
      which cannot be changed at run-time and thus memory loads and jumps can be
      avoided altogether.
      
      This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
      of the sysctl_sched_features constant for each translation unit. This will
      ultimately allow the compiler to perform constants propagation and dead-code
      pruning.
      
      Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
      the patch, by running 30 iterations of:
      
         perf bench sched messaging --pipe --thread --group 4 --loop 50000
      
      on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
      powersave governor to rule out variations due to frequency scaling.
      
      Statistics on the reported completion time:
      
                         count     mean       std     min       99%     max
        v4.14-rc8         30.0  15.7831  0.176032  15.442  16.01226  16.014
        v4.14-rc8+patch   30.0  15.5033  0.189681  15.232  15.93938  15.962
      
      ... show a 1.8% speedup on average completion time and 0.5% speedup in the
      99 percentile.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NChris Redpath <chris.redpath@arm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      765cc3a4
    • V
      cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq · 07458f6a
      Viresh Kumar 提交于
      'cached_raw_freq' is used to get the next frequency quickly but should
      always be in sync with sg_policy->next_freq. There is a case where it is
      not and in such cases it should be reset to avoid switching to incorrect
      frequencies.
      
      Consider this case for example:
      
       - policy->cur is 1.2 GHz (Max)
       - New request comes for 780 MHz and we store that in cached_raw_freq.
       - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
       - We then see the CPU wasn't idle recently and choose to keep the next
         freq as 1.2 GHz.
       - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
         1.2 GHz.
       - Now if the utilization doesn't change in then next request, then the
         next target frequency will still be 780 MHz and it will match with
         cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.
      
      Fixes: b7eaf1aa (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      07458f6a
    • R
      PM / s2idle: Clear the events_check_enabled flag · 95b982b4
      Rajat Jain 提交于
      Problem: This flag does not get cleared currently in the suspend or
      resume path in the following cases:
      
       * In case some driver's suspend routine returns an error.
       * Successful s2idle case
       * etc?
      
      Why is this a problem: What happens is that the next suspend attempt
      could fail even though the user did not enable the flag by writing to
      /sys/power/wakeup_count. This is 1 use case how the issue can be seen
      (but similar use case with driver suspend failure can be thought of):
      
       1. Read /sys/power/wakeup_count
       2. echo count > /sys/power/wakeup_count
       3. echo freeze > /sys/power/wakeup_count
       4. Let the system suspend, and wakeup the system using some wake source
          that calls pm_wakeup_event() e.g. power button or something.
       5. Note that the combined wakeup count would be incremented due
          to the pm_wakeup_event() in the resume path.
       6. After resuming the events_check_enabled flag is still set.
      
      At this point if the user attempts to freeze again (without writing to
      /sys/power/wakeup_count), the suspend would fail even though there has
      been no wake event since the past resume.
      
      Address that by clearing the flag just before a resume is completed,
      so that it is always cleared for the corner cases mentioned above.
      Signed-off-by: NRajat Jain <rajatja@google.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      95b982b4
    • L
      stop using '%pK' for /proc/kallsyms pointer values · c0f3ea15
      Linus Torvalds 提交于
      Not only is it annoying to have one single flag for all pointers, as if
      that was a global choice and all kernel pointers are the same, but %pK
      can't get the 'access' vs 'open' time check right anyway.
      
      So make the /proc/kallsyms pointer value code use logic specific to that
      particular file.  We do continue to honor kptr_restrict, but the default
      (which is unrestricted) is changed to instead take expected users into
      account, and restrict access by default.
      
      Right now the only actual expected user is kernel profiling, which has a
      separate sysctl flag for kernel profile access.  There may be others.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0f3ea15
    • B
      module: export module signature enforcement status · fda784e5
      Bruno E. O. Meneguele 提交于
      A static variable sig_enforce is used as status var to indicate the real
      value of CONFIG_MODULE_SIG_FORCE, once this one is set the var will hold
      true, but if the CONFIG is not set the status var will hold whatever
      value is present in the module.sig_enforce kernel cmdline param: true
      when =1 and false when =0 or not present.
      
      Considering this cmdline param take place over the CONFIG value when
      it's not set, other places in the kernel could misbehave since they
      would have only the CONFIG_MODULE_SIG_FORCE value to rely on. Exporting
      this status var allows the kernel to rely in the effective value of
      module signature enforcement, being it from CONFIG value or cmdline
      param.
      Signed-off-by: NBruno E. O. Meneguele <brdeoliv@redhat.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      fda784e5
  6. 08 11月, 2017 12 次提交
  7. 07 11月, 2017 5 次提交
  8. 05 11月, 2017 1 次提交
  9. 03 11月, 2017 1 次提交
    • D
      arm64/sve: Add prctl controls for userspace vector length management · 2d2123bc
      Dave Martin 提交于
      This patch adds two arm64-specific prctls, to permit userspace to
      control its vector length:
      
       * PR_SVE_SET_VL: set the thread's SVE vector length and vector
         length inheritance mode.
      
       * PR_SVE_GET_VL: get the same information.
      
      Although these prctls resemble instruction set features in the SVE
      architecture, they provide additional control: the vector length
      inheritance mode is Linux-specific and nothing to do with the
      architecture, and the architecture does not permit EL0 to set its
      own vector length directly.  Both can be used in portable tools
      without requiring the use of SVE instructions.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alex Bennée <alex.bennee@linaro.org>
      [will: Fixed up prctl constants to avoid clash with PDEATHSIG]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      2d2123bc