1. 18 3月, 2016 2 次提交
  2. 16 3月, 2016 4 次提交
    • A
      kallsyms: add support for relative offsets in kallsyms address table · 2213e9a6
      Ard Biesheuvel 提交于
      Similar to how relative extables are implemented, it is possible to emit
      the kallsyms table in such a way that it contains offsets relative to
      some anchor point in the kernel image rather than absolute addresses.
      
      On 64-bit architectures, it cuts the size of the kallsyms address table
      in half, since offsets between kernel symbols can typically be expressed
      in 32 bits.  This saves several hundreds of kilobytes of permanent
      .rodata on average.  In addition, the kallsyms address table is no
      longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
      effect, so the relocation work done after decompression now doesn't have
      to do relocation updates for all these values.  This saves up to 24
      bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
      which easily adds up to a couple of megabytes of uncompressed __init
      data on ppc64 or arm64.  Even if these relocation entries typically
      compress well, the combined size reduction of 2.8 MB uncompressed for a
      ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
      KB space saving in the compressed image.
      
      Since it is useful for some architectures (like x86) to retain the
      ability to emit absolute values as well, this patch also adds support
      for capturing both absolute and relative values when
      KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
      addresses as positive 32-bit values, and addresses relative to the
      lowest encountered relative symbol as negative values, which are
      subtracted from the runtime address of this base symbol to produce the
      actual address.
      
      Support for the above is enabled by default for all architectures except
      IA-64 and Tile-GX, whose symbols are too far apart to capture in this
      manner.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2213e9a6
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • A
      mm: fix two typos in comments for to_vmem_altmap() · 07061aab
      Andreas Ziegler 提交于
      Commit 4b94ffdc ("x86, mm: introduce vmem_altmap to augment
      vmemmap_populate()"), introduced the to_vmem_altmap() function.
      
      The comments in this function contain two typos (one misspelling of the
      Kconfig option CONFIG_SPARSEMEM_VMEMMAP, and one missing letter 'n'),
      let's fix them up.
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07061aab
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  3. 13 3月, 2016 1 次提交
  4. 11 3月, 2016 2 次提交
    • T
      cpu/hotplug: Fix smpboot thread ordering · 2a58c527
      Thomas Gleixner 提交于
      Commit 931ef163 moved the smpboot thread park/unpark invocation to the
      state machine. The move of the unpark invocation was premature as it depends
      on work in progress patches.
      
      As a result cpu down can fail, because rcu synchronization in takedown_cpu()
      eventually requires a functional softirq thread. I never encountered the
      problem in testing, but 0day testing managed to provide a reliable reproducer.
      
      Remove the smpboot_threads_park() call from the state machine for now and put
      it back into the original place after the rcu synchronization.
      
      I'm embarrassed as I knew about the dependency and still managed to get it
      wrong. Hotplug induced brain melt seems to be the only sensible explanation
      for that.
      
      Fixes: 931ef163 "cpu/hotplug: Unpark smpboot threads from the state machine"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2a58c527
    • R
      cpufreq: Move scheduler-related code to the sched directory · adaf9fcd
      Rafael J. Wysocki 提交于
      Create cpufreq.c under kernel/sched/ and move the cpufreq code
      related to the scheduler to that file and to sched.h.
      
      Redefine cpufreq_update_util() as a static inline function to avoid
      function calls at its call sites in the scheduler code (as suggested
      by Peter Zijlstra).
      
      Also move the definition of struct update_util_data and declaration
      of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
      include/linux/sched.h.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      adaf9fcd
  5. 10 3月, 2016 10 次提交
  6. 09 3月, 2016 3 次提交
    • R
      cpufreq: Add mechanism for registering utilization update callbacks · 34e2c555
      Rafael J. Wysocki 提交于
      Introduce a mechanism by which parts of the cpufreq subsystem
      ("setpolicy" drivers or the core) can register callbacks to be
      executed from cpufreq_update_util() which is invoked by the
      scheduler's update_load_avg() on CPU utilization changes.
      
      This allows the "setpolicy" drivers to dispense with their timers
      and do all of the computations they need and frequency/voltage
      adjustments in the update_load_avg() code path, among other things.
      
      The update_load_avg() changes were suggested by Peter Zijlstra.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      34e2c555
    • C
      x86/ACPI/PCI: Recognize that Interrupt Line 255 means "not connected" · e237a551
      Chen Fan 提交于
      Per the x86-specific footnote to PCI spec r3.0, sec 6.2.4, the value 255 in
      the Interrupt Line register means "unknown" or "no connection."
      Previously, when we couldn't derive an IRQ from the _PRT, we fell back to
      using the value from Interrupt Line as an IRQ.  It's questionable whether
      we should do that at all, but the spec clearly suggests we shouldn't do it
      for the value 255 on x86.
      
      Calling request_irq() with IRQ 255 may succeed, but the driver won't
      receive any interrupts.  Or, if IRQ 255 is shared with another device, it
      may succeed, and the driver's ISR will be called at random times when the
      *other* device interrupts.  Or it may fail if another device is using IRQ
      255 with incompatible flags.  What we *want* is for request_irq() to fail
      predictably so the driver can fall back to polling.
      
      On x86, assume 255 in the Interrupt Line means the INTx line is not
      connected.  In that case, set dev->irq to IRQ_NOTCONNECTED so request_irq()
      will fail gracefully with -ENOTCONN.
      
      We found this problem on a system where Secure Boot firmware assigned
      Interrupt Line 255 to an i801_smbus device and another device was already
      using MSI-X IRQ 255.  This was in v3.10, where i801_probe() fails if
      request_irq() fails:
      
        i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
        i801_smbus 0000:00:1f.3: can't derive routing for PCI INT C
        i801_smbus 0000:00:1f.3: PCI INT C: no GSI
        genirq: Flags mismatch irq 255. 00000080 (i801_smbus) vs. 00000000 (megasa)
        CPU: 0 PID: 2487 Comm: kworker/0:1 Not tainted 3.10.0-229.el7.x86_64 #1
        Hardware name: FUJITSU PRIMEQUEST 2800E2/D3736, BIOS PRIMEQUEST 2000 Serie5
        Call Trace:
          dump_stack+0x19/0x1b
          __setup_irq+0x54a/0x570
          request_threaded_irq+0xcc/0x170
          i801_probe+0x32f/0x508 [i2c_i801]
          local_pci_probe+0x45/0xa0
        i801_smbus 0000:00:1f.3: Failed to allocate irq 255: -16
        i801_smbus: probe of 0000:00:1f.3 failed with error -16
      
      After aeb8a3d1 ("i2c: i801: Check if interrupts are disabled"),
      i801_probe() will fall back to polling if request_irq() fails.  But we
      still need this patch because request_irq() may succeed or fail depending
      on other devices in the system.  If request_irq() fails, i801_smbus will
      work by falling back to polling, but if it succeeds, i801_smbus won't work
      because it expects interrupts that it may not receive.
      Signed-off-by: NChen Fan <chen.fan.fnst@cn.fujitsu.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e237a551
    • J
      futex: Replace barrier() in unqueue_me() with READ_ONCE() · 29b75eb2
      Jianyu Zhan 提交于
      Commit e91467ec ("bug in futex unqueue_me") introduced a barrier() in
      unqueue_me() to prevent the compiler from rereading the lock pointer which
      might change after a check for NULL.
      
      Replace the barrier() with a READ_ONCE() for the following reasons:
      
      1) READ_ONCE() is a weaker form of barrier() that affects only the specific
         load operation, while barrier() is a general compiler level memory barrier.
         READ_ONCE() was not available at the time when the barrier was added.
      
      2) Aside of that READ_ONCE() is descriptive and self explainatory while a
         barrier without comment is not clear to the casual reader.
      
      No functional change.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: dave@stgolabs.net
      Cc: peterz@infradead.org
      Cc: linux@rasmusvillemoes.dk
      Cc: akpm@linux-foundation.org
      Cc: fengguang.wu@intel.com
      Cc: bigeasy@linutronix.de
      Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      29b75eb2
  7. 08 3月, 2016 4 次提交
    • C
      sched/cputime: Fix steal_account_process_tick() to always return jiffies · f9c904b7
      Chris Friesen 提交于
      The callers of steal_account_process_tick() expect it to return
      whether a jiffy should be considered stolen or not.
      
      Currently the return value of steal_account_process_tick() is in
      units of cputime, which vary between either jiffies or nsecs
      depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.
      
      If cputime has nsecs granularity and there is a tiny amount of
      stolen time (a few nsecs, say) then we will consider the entire
      tick stolen and will not account the tick on user/system/idle,
      causing /proc/stats to show invalid data.
      
      The fix is to change steal_account_process_tick() to accumulate
      the stolen time and only account it once it's worth a jiffy.
      
      (Thanks to Frederic Weisbecker for suggestions to fix a bug in my
      first version of the patch.)
      Signed-off-by: NChris Friesen <chris.friesen@windriver.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.caSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9c904b7
    • L
      sched/deadline: Remove dl_new from struct sched_dl_entity · 72f9f3fd
      Luca Abeni 提交于
      The dl_new field of struct sched_dl_entity is currently used to
      identify new deadline tasks, so that their deadline and runtime
      can be properly initialised.
      
      However, these tasks can be easily identified by checking if
      their deadline is smaller than the current time when they switch
      to SCHED_DEADLINE. So, dl_new can be removed by introducing this
      check in switched_to_dl(); this allows to simplify the
      SCHED_DEADLINE code.
      Signed-off-by: NLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1457350024-7825-2-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72f9f3fd
    • A
      perf/core: Fix perf_sched_count derailment · 927a5570
      Alexander Shishkin 提交于
      The error path in perf_event_open() is such that asking for a sampling
      event on a PMU that doesn't generate interrupts will end up in dropping
      the perf_sched_count even though it hasn't been incremented for this
      event yet.
      
      Given a sufficient amount of these calls, we'll end up disabling
      scheduler's jump label even though we'd still have active events in the
      system, thereby facilitating the arrival of the infernal regions upon us.
      
      I'm fixing this by moving account_event() inside perf_event_alloc().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      927a5570
    • I
      time/timekeeping: Work around false positive GCC warning · 6436257b
      Ingo Molnar 提交于
      Newer GCC versions trigger the following warning:
      
        kernel/time/timekeeping.c: In function ‘get_device_system_crosststamp’:
        kernel/time/timekeeping.c:987:5: warning: ‘clock_was_set_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
          if (discontinuity) {
           ^
        kernel/time/timekeeping.c:1045:15: note: ‘clock_was_set_seq’ was declared here
          unsigned int clock_was_set_seq;
                       ^
      
      GCC clearly is unable to recognize that the 'do_interp' boolean tracks
      the initialization status of 'clock_was_set_seq'.
      
      The GCC version used was:
      
        gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)
      
      Work it around by initializing clock_was_set_seq to 0. Compilers that
      are able to recognize the code flow will eliminate the unnecessary
      initialization.
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6436257b
  8. 06 3月, 2016 1 次提交
  9. 04 3月, 2016 1 次提交
    • S
      tracing: Do not have 'comm' filter override event 'comm' field · e57cbaf0
      Steven Rostedt (Red Hat) 提交于
      Commit 9f616680 "tracing: Allow triggers to filter for CPU ids and
      process names" added a 'comm' filter that will filter events based on the
      current tasks struct 'comm'. But this now hides the ability to filter events
      that have a 'comm' field too. For example, sched_migrate_task trace event.
      That has a 'comm' field of the task to be migrated.
      
       echo 'comm == "bash"' > events/sched_migrate_task/filter
      
      will now filter all sched_migrate_task events for tasks named "bash" that
      migrates other tasks (in interrupt context), instead of seeing when "bash"
      itself gets migrated.
      
      This fix requires a couple of changes.
      
      1) Change the look up order for filter predicates to look at the events
         fields before looking at the generic filters.
      
      2) Instead of basing the filter function off of the "comm" name, have the
         generic "comm" filter have its own filter_type (FILTER_COMM). Test
         against the type instead of the name to assign the filter function.
      
      3) Add a new "COMM" filter that works just like "comm" but will filter based
         on the current task, even if the trace event contains a "comm" field.
      
      Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".
      
      Cc: stable@vger.kernel.org # v4.3+
      Fixes: 9f616680 "tracing: Allow triggers to filter for CPU ids and process names"
      Reported-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e57cbaf0
  10. 03 3月, 2016 7 次提交
    • T
      hrtimer: Revert CLOCK_MONOTONIC_RAW support · 82e88ff1
      Thomas Gleixner 提交于
      Revert commits:
      a6e707dd: KVM: arm/arm64: timer: Switch to CLOCK_MONOTONIC_RAW
      9006a018: hrtimer: Catch illegal clockids
      9c808765: hrtimer: Add support for CLOCK_MONOTONIC_RAW
      
      Marc found out, that there are fundamental issues with that patch series
      because __hrtimer_get_next_event() and hrtimer_forward() need support for
      CLOCK_MONOTONIC_RAW. Nothing which is easily fixed, so revert the whole lot.
      Reported-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/56D6CEF0.8060607@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      82e88ff1
    • T
      cpu/hotplug: Plug death reporting race · 71f87b2f
      Thomas Gleixner 提交于
      Paul noticed that the conversion of the death reporting introduced a race
      where the outgoing cpu might be delayed after waking the controll processor,
      so it might not be able to call rcu_report_dead() before being physically
      removed, leading to RCU stalls.
      
      We cant call complete after rcu_report_dead(), so instead of going back to
      busy polling, simply issue a function call to do the completion.
      
      Fixes: 27d50c7e "rcu: Make CPU_DYING_IDLE an explicit call"
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20160302201127.GA23440@linux.vnet.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      71f87b2f
    • C
      time: Add history to cross timestamp interface supporting slower devices · 2c756feb
      Christopher S. Hall 提交于
      Another representative use case of time sync and the correlated
      clocksource (in addition to PTP noted above) is PTP synchronized
      audio.
      
      In a streaming application, as an example, samples will be sent and/or
      received by multiple devices with a presentation time that is in terms
      of the PTP master clock. Synchronizing the audio output on these
      devices requires correlating the audio clock with the PTP master
      clock. The more precise this correlation is, the better the audio
      quality (i.e. out of sync audio sounds bad).
      
      From an application standpoint, to correlate the PTP master clock with
      the audio device clock, the system clock is used as a intermediate
      timebase. The transforms such an application would perform are:
      
          System Clock <-> Audio clock
          System Clock <-> Network Device Clock [<-> PTP Master Clock]
      
      Modern Intel platforms can perform a more accurate cross timestamp in
      hardware (ART,audio device clock).  The audio driver requires
      ART->system time transforms -- the same as required for the network
      driver. These platforms offload audio processing (including
      cross-timestamps) to a DSP which to ensure uninterrupted audio
      processing, communicates and response to the host only once every
      millsecond. As a result is takes up to a millisecond for the DSP to
      receive a request, the request is processed by the DSP, the audio
      output hardware is polled for completion, the result is copied into
      shared memory, and the host is notified. All of these operation occur
      on a millisecond cadence.  This transaction requires about 2 ms, but
      under heavier workloads it may take up to 4 ms.
      
      Adding a history allows these slow devices the option of providing an
      ART value outside of the current interval. In this case, the callback
      provided is an accessor function for the previously obtained counter
      value. If get_system_device_crosststamp() receives a counter value
      previous to cycle_last, it consults the history provided as an
      argument in history_ref and interpolates the realtime and monotonic
      raw system time using the provided counter value. If there are any
      clock discontinuities, e.g. from calling settimeofday(), the monotonic
      raw time is interpolated in the usual way, but the realtime clock time
      is adjusted by scaling the monotonic raw adjustment.
      
      When an accessor function is used a history argument *must* be
      provided. The history is initialized using ktime_get_snapshot() and
      must be called before the counter values are read.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Fixed up cycles_t/cycle_t type confusion]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      2c756feb
    • C
      time: Add driver cross timestamp interface for higher precision time synchronization · 8006c245
      Christopher S. Hall 提交于
      ACKNOWLEDGMENT: cross timestamp code was developed by Thomas Gleixner
      <tglx@linutronix.de>. It has changed considerably and any mistakes are
      mine.
      
      The precision with which events on multiple networked systems can be
      synchronized using, as an example, PTP (IEEE 1588, 802.1AS) is limited
      by the precision of the cross timestamps between the system clock and
      the device (timestamp) clock. Precision here is the degree of
      simultaneity when capturing the cross timestamp.
      
      Currently the PTP cross timestamp is captured in software using the
      PTP device driver ioctl PTP_SYS_OFFSET. Reads of the device clock are
      interleaved with reads of the realtime clock. At best, the precision
      of this cross timestamp is on the order of several microseconds due to
      software latencies. Sub-microsecond precision is required for
      industrial control and some media applications. To achieve this level
      of precision hardware supported cross timestamping is needed.
      
      The function get_device_system_crosstimestamp() allows device drivers
      to return a cross timestamp with system time properly scaled to
      nanoseconds.  The realtime value is needed to discipline that clock
      using PTP and the monotonic raw value is used for applications that
      don't require a "real" time, but need an unadjusted clock time.  The
      get_device_system_crosstimestamp() code calls back into the driver to
      ensure that the system counter is within the current timekeeping
      update interval.
      
      Modern Intel hardware provides an Always Running Timer (ART) which is
      exactly related to TSC through a known frequency ratio. The ART is
      routed to devices on the system and is used to precisely and
      simultaneously capture the device clock with the ART.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Reworked to remove extra structures and simplify calling]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8006c245
    • C
      time: Remove duplicated code in ktime_get_raw_and_real() · ba26621e
      Christopher S. Hall 提交于
      The code in ktime_get_snapshot() is a superset of the code in
      ktime_get_raw_and_real() code. Further, ktime_get_raw_and_real() is
      called only by the PPS code, pps_get_ts(). Consolidate the
      pps_get_ts() code into a single function calling ktime_get_snapshot()
      and eliminate ktime_get_raw_and_real(). A side effect of this is that
      the raw and real results of pps_get_ts() correspond to exactly the
      same clock cycle. Previously these values represented separate reads
      of the system clock.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      ba26621e
    • C
      time: Add timekeeping snapshot code capturing system time and counter · 9da0f49c
      Christopher S. Hall 提交于
      In the current timekeeping code there isn't any interface to
      atomically capture the current relationship between the system counter
      and system time. ktime_get_snapshot() returns this triple (counter,
      monotonic raw, realtime) in the system_time_snapshot struct.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Moved structure definitions around to clean things up,
       fixed cycles_t/cycle_t confusion.]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      9da0f49c
    • C
      time: Add cycles to nanoseconds translation · 6bd58f09
      Christopher S. Hall 提交于
      The timekeeping code does not currently provide a way to translate
      externally provided clocksource cycles to system time. The cycle count
      is always provided by the result clocksource read() method internal to
      the timekeeping code. The added function timekeeping_cycles_to_ns()
      calculated a nanosecond value from a cycle count that can be added to
      tk_read_base.base value yielding the current system time. This allows
      clocksource cycle values external to the timekeeping code to provide a
      cycle count that can be transformed to system time.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6bd58f09
  11. 02 3月, 2016 5 次提交
    • F
      sched-clock: Migrate to use new tick dependency mask model · 4f49b90a
      Frederic Weisbecker 提交于
      Instead of checking sched_clock_stable from the nohz subsystem to verify
      its tick dependency, migrate it to the new mask in order to include it
      to the all-in-one check.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4f49b90a
    • F
      posix-cpu-timers: Migrate to use new tick dependency mask model · b7878300
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      posix cpu timers tick dependency, migrate the latter to the new mask.
      
      In order to keep track of the running timers and expose the tick
      dependency accordingly, we must probe the timers queuing and dequeuing
      on threads and process lists.
      
      Unfortunately it implies both task and signal level dependencies. We
      should be able to further optimize this and merge all that on the task
      level dependency, at the cost of a bit of complexity and may be overhead.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      b7878300
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3
    • F
      sched: Account rr tasks · 01d36d0a
      Frederic Weisbecker 提交于
      In order to evaluate the scheduler tick dependency without probing
      context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
      are enqueued as those policies don't have the same preemption
      requirements.
      
      To prepare for that, let's account SCHED_RR tasks, we'll be able to
      deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
      runqueue.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      01d36d0a
    • F
      perf: Migrate perf to use new tick dependency mask model · 555e0c1e
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      perf event tick dependency, migrate perf to the new mask.
      
      Perf needs the tick for two situations:
      
      1) Freq events. We could set the tick dependency when those are
      installed on a CPU context. But setting a global dependency on top of
      the global freq events accounting is much easier. If people want that
      to be optimized, we can still refine that on the per-CPU tick dependency
      level. This patch dooesn't change the current behaviour anyway.
      
      2) Throttled events: this is a per-cpu dependency.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      555e0c1e