1. 07 6月, 2011 1 次提交
    • A
      x86-64: Emulate legacy vsyscalls · 5cec93c2
      Andy Lutomirski 提交于
      There's a fair amount of code in the vsyscall page.  It contains
      a syscall instruction (in the gettimeofday fallback) and who
      knows what will happen if an exploit jumps into the middle of
      some other code.
      
      Reduce the risk by replacing the vsyscalls with short magic
      incantations that cause the kernel to emulate the real
      vsyscalls. These incantations are useless if entered in the
      middle.
      
      This causes vsyscalls to be a little more expensive than real
      syscalls.  Fortunately sensible programs don't use them.
      The only exception is time() which is still called by glibc
      through the vsyscall - but calling time() millions of times
      per second is not sensible. glibc has this fixed in the
      development tree.
      
      This patch is not perfect: the vread_tsc and vread_hpet
      functions are still at a fixed address.  Fixing that might
      involve making alternative patching work in the vDSO.
      Signed-off-by: NAndy Lutomirski <luto@mit.edu>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Jesper Juhl <jj@chaosbits.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
      Cc: Mikael Pettersson <mikpe@it.uu.se>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: pageexec@freemail.hu
      Link: http://lkml.kernel.org/r/e64e1b3c64858820d12c48fa739efbd1485e79d5.1307292171.git.luto@mit.edu
      [ Removed the CONFIG option - it's simpler to just do it unconditionally. Tidied up the code as well. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5cec93c2
  2. 03 3月, 2011 1 次提交
  3. 14 2月, 2011 2 次提交
    • S
      x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 · 70e4a369
      Shaohua Li 提交于
      Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
      with a maximum of 32 vectors.
      
      We currently only have 8 vectors for TLB invalidation and that is clearly
      inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
      tlbstate_lock is used to protect them. flush_tlb_page() is
      heavily used in page reclaim, which will cause a lot of lock
      contention for tlbstate_lock.
      
      Andi Kleen suggested increasing the vectors number to 32, which should be
      good for current typical systems to reduce the tlbstate_lock contention.
      
      My test system has 4 sockets and 64G memory, and 64 CPUs. My
      workload creates 64 processes. Each process mmap reads a big
      empty sparse file. The total size of the files are 2*total_mem,
      so this will cause a lot of page reclaim.
      
      Below is the result I get from perf call-graph profiling:
      
       without the patch:
       ------------------
      
          24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |
                              |--42.15%-- native_flush_tlb_others
      
       with the patch:
       ------------------
      
          14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |--13.89%-- native_flush_tlb_others
      
      So this heavily reduces the tlbstate_lock contention.
      Suggested-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      70e4a369
    • S
      x86: Cleanup vector usage · 60f6e65d
      Shaohua Li 提交于
      Cleanup the vector usage and make them continuous if possible.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <1295232722.1949.707.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60f6e65d
  4. 19 10月, 2010 1 次提交
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
  5. 23 7月, 2010 1 次提交
  6. 19 1月, 2010 1 次提交
    • S
      x86, irq: Use 0x20 for the IRQ_MOVE_CLEANUP_VECTOR instead of 0x1f · 6579b474
      Suresh Siddha 提交于
      After talking to some more folks inside intel (Peter Anvin, Asit Mallick),
      the safest option (for future compatibility etc) seen was to use vector 0x20
      for IRQ_MOVE_CLEANUP_VECTOR instead of using vector 0x1f (which is documented as
      reserved vector in the Intel IA32 manuals).
      
      Also we don't need to reserve the entire privilege level (all 16 vectors in
      the priority bucket that IRQ_MOVE_CLEANUP_VECTOR falls into), as the
      x86 architecture (section 10.9.3 in SDM Vol3a) specifies that with in the
      priority level, the higher the vector number the higher the priority.
      And hence we don't need to reserve the complete priority level 0x20-0x2f for
      the IRQ migration cleanup logic.
      
      So change the IRQ_MOVE_CLEANUP_VECTOR to 0x20 and  allow 0x21-0x2f to be used
      for device interrupts. 0x30-0x3f will be used for ISA interrupts (these
      also can be migrated in the context of IOAPIC and hence need to be at a higher
      priority level than IRQ_MOVE_CLEANUP_VECTOR).
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20100114002118.521826763@sbs-t61.sc.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      6579b474
  7. 05 1月, 2010 2 次提交
    • H
      x86, apic: Don't waste a vector to improve vector spread · ea943966
      H. Peter Anvin 提交于
      We want to use a vector-assignment sequence that avoids stumbling onto
      0x80 earlier in the sequence, in order to improve the spread of
      vectors across priority levels on machines with a small number of
      interrupt sources.  Right now, this is done by simply making the first
      vector (0x31 or 0x41) completely unusable.  This is unnecessary; all
      we need is to start assignment at a +1 offset, we don't actually need
      to prohibit the usage of this vector once we have wrapped around.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      LKML-Reference: <4B426550.6000209@kernel.org>
      ea943966
    • H
      x86, apic: Reclaim IDT vectors 0x20-0x2f · 99d113b1
      H. Peter Anvin 提交于
      Reclaim 16 IDT vectors and make them available for general allocation.
      
      Reclaim vectors 0x20-0x2f by reallocating the IRQ_MOVE_CLEANUP_VECTOR
      to vector 0x1f.  This is in the range of vector numbers that is
      officially reserved for the CPU (for exceptions), however, the use of
      the APIC to generate any vector 0x10 or above is documented, and the
      CPU internally can receive any vector number (the legacy BIOS uses INT
      0x08-0x0f for interrupts, as messed up as that is.)
      
      Since IRQ_MOVE_CLEANUP_VECTOR has to be alone in the lowest-numbered
      priority level (block of 16), this effectively enables us to reclaim
      an otherwise-unusable APIC priority level and put it to use.
      
      Since this is a transient kernel-only allocation we can change it at
      any time, and if/when there is an exception at vector 0x1f this
      assignment needs to be changed as part of OS enabling that new feature.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B4284C6.9030107@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      99d113b1
  8. 30 12月, 2009 1 次提交
    • Y
      x86: Increase NR_IRQS and nr_irqs · 9959c888
      Yinghai Lu 提交于
      I have a system with lots of igb and ixgbe, when iov/vf are
      enabled for them, we hit the limit of 3064.
      
      when system has 20 pcie installed, and one card has 2
      functions, and one function needs 64 msi-x,
       may need 20 * 2 * 64 = 2560 for msi-x
      
      but if iov and vf are enabled
       may need 20 * 2 * 64 * 3 = 7680 for msi-x
      assume system with 5 ioapic, nr_irqs_gsi will be 120.
      
      NR_CPUS = 512, and nr_cpu_ids = 128
      will have NR_IRQS = 256 + 512 * 64 = 33024
      will have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
      
      When SPARSE_IRQ is not set, there is no increase with kernel data
      size.
      
      when NR_CPUS=128, and SPARSE_IRQ is set:
         text		   data	    bss		   dec		 hex	filename
      21837444	4216564	12480736	38534744	24bfe58	vmlinux.before
      21837442	4216580	12480736	38534758	24bfe66	vmlinux.after
      when NR_CPUS=4096, and SPARSE_IRQ is set
         text		   data	    bss		   dec		 hex	filename
      21878619	5610244	13415392	40904255	270263f	vmlinux.before
      21878617	5610244	13415392	40904253	270263d	vmlinux.after
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B398ECD.1080506@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9959c888
  9. 13 12月, 2009 1 次提交
  10. 15 10月, 2009 1 次提交
  11. 04 6月, 2009 3 次提交
    • A
      x86, mce: define MCE_VECTOR · 8fa8dd9e
      Andi Kleen 提交于
      Add MCE_VECTOR for the #MC exception.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      8fa8dd9e
    • A
      x86: fix panic with interrupts off (needed for MCE) · 4ef702c1
      Andi Kleen 提交于
      For some time each panic() called with interrupts disabled
      triggered the !irqs_disabled() WARN_ON in smp_call_function(),
      producing ugly backtraces and confusing users.
      
      This is a common situation with machine checks for example which
      tend to call panic with interrupts disabled, but will also hit
      in other situations e.g. panic during early boot.  In fact it
      means that panic cannot be called in many circumstances, which
      would be bad.
      
      This all started with the new fancy queued smp_call_function,
      which is then used by the shutdown path to shut down the other
      CPUs.
      
      On closer examination it turned out that the fancy RCU
      smp_call_function() does lots of things not suitable in a panic
      situation anyways, like allocating memory and relying on complex
      system state.
      
      I originally tried to patch this over by checking for panic
      there, but it was quite complicated and the original patch
      was also not very popular.  This also didn't fix some of the
      underlying complexity problems.
      
      The new code in post 2.6.29 tries to patch around this by
      checking for oops_in_progress, but that is not enough to make
      this fully safe and I don't think that's a real solution
      because panic has to be reliable.
      
      So instead use an own vector to reboot.  This makes the reboot
      code extremly straight forward, which is definitely a big plus
      in a panic situation where it is important to avoid relying on
      too much kernel state.  The new simple code is also safe to be
      called from interupts off region because it is very very simple.
      
      There can be situations where it is important that panic
      is reliable.  For example on a fatal machine check the panic
      is needed to get the system up again and running as quickly
      as possible.  So it's important that panic is reliable and
      all function it calls simple.
      
      This is why I came up with this simple vector scheme.
      It's very hard to beat in simplicity.  Vectors are not
      particularly precious anymore since all big systems are
      using per CPU vectors.
      
      Another possibility would have been to use an NMI similar
      to kdump, but there is still the problem that NMIs don't
      work reliably on some systems due to BIOS issues.  NMIs
      would have been able to stop CPUs running with interrupts
      off too.  In the sake of universal reliability I opted for
      using a non NMI vector for now.
      
      I put the reboot vector into the highest priority bucket of
      the APIC vectors and moved the 64bit UV_BAU message down
      instead into the next lower priority.
      
      [ Impact: bug fix, fixes an old regression ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      4ef702c1
    • A
      x86, mce: implement bootstrapping for machine check wakeups · ccc3c319
      Andi Kleen 提交于
      Machine checks support waking up the mcelog daemon quickly.
      
      The original wake up code for this was pretty ugly, relying on
      a idle notifier and a special process flag. The reason it did
      it this way is that the machine check handler is not subject
      to normal interrupt locking rules so it's not safe
      to call wake_up().  Instead it set a process flag
      and then either did the wakeup in the syscall return
      or in the idle notifier.
      
      This patch adds a new "bootstraping" method as replacement.
      
      The idea is that the handler checks if it's in a state where
      it is unsafe to call wake_up(). If it's safe it calls it directly.
      When it's not safe -- that is it interrupted in a critical
      section with interrupts disables -- it uses a new "self IPI" to trigger
      an IPI to its own CPU. This can be done safely because IPI
      triggers are atomic with some care. The IPI is raised
      once the interrupts are reenabled and can then safely call
      wake_up().
      
      When APICs are disabled the event is just queued and will be picked up
      eventually by the next polling timer. I think that's a reasonable
      compromise, since it should only happen quite rarely.
      
      Contains fixes from Ying Huang.
      
      [ solve conflict on irqinit, make it work on 32bit (entry_arch.h) - HS ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ccc3c319
  12. 03 6月, 2009 1 次提交
    • Y
      perf_counter/x86: Remove the IRQ (non-NMI) handling bits · a3288106
      Yong Wang 提交于
      Remove the IRQ (non-NMI) handling bits as NMI will be used always.
      Signed-off-by: NYong Wang <yong.y.wang@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      LKML-Reference: <20090603051255.GA2791@ywang-moblin2.bj.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a3288106
  13. 29 5月, 2009 2 次提交
  14. 10 4月, 2009 1 次提交
  15. 07 4月, 2009 1 次提交
  16. 05 3月, 2009 1 次提交
  17. 25 2月, 2009 1 次提交
  18. 16 2月, 2009 1 次提交
  19. 31 1月, 2009 9 次提交
  20. 21 1月, 2009 2 次提交
    • T
      x86: make x86_32 use tlb_64.c · 02cf94c3
      Tejun Heo 提交于
      Impact: less contention when issuing invalidate IPI, cleanup
      
      Make x86_32 use the same tlb code as 64bit.  The 64bit code uses
      multiple IPI vectors for tlb shootdown to reduce contention.  This
      patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the
      code paths.
      
      Note that the usage of asmlinkage is inconsistent for x86_32 and 64
      and calls for further cleanup.  This has been noted with a FIXME
      comment in tlb_64.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      02cf94c3
    • T
      x86: prepare for tlb merge · 6dd01bed
      Tejun Heo 提交于
      Impact: clean up, ipi vector number reordering for x86_32
      
      Make the following changes to prepare for tlb merge.
      
      * reorder x86_32 ip vectors
      
      * adjust tlb_32.c and tlb_64.c such that their logics coincide exactly
      	- on spurious invalidate ipi, tlb_32 acks the irq
      	- tlb_64 now has proper memory barriers around clearing
                flush_cpumask (no change in generated code)
      
      * unexport flush_tlb_page from tlb_32.c, there's no user
      
      * use unsigned int for cpu id
      
      * drop unnecessary includes from tlb_64.c
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6dd01bed
  21. 13 1月, 2009 1 次提交
  22. 12 1月, 2009 1 次提交
    • M
      irq: initialize nr_irqs based on nr_cpu_ids · 9332fccd
      Mike Travis 提交于
      Impact: Reduce memory usage.
      
      This is the second half of the changes to make the irq_desc_ptrs be
      variable sized based on nr_cpu_ids.  This is done by adding a new
      "max_nr_irqs" macro to irq_vectors.h (and a dummy in irqnr.h) to
      return a max NR_IRQS value based on NR_CPUS or nr_cpu_ids.
      
      This necessitated moving the define of MAX_IO_APICS to a separate
      file (asm/apicnum.h) so it could be included without the baggage
      of the other asm/apicdef.h declarations.
      Signed-off-by: NMike Travis <travis@sgi.com>
      9332fccd
  23. 08 12月, 2008 3 次提交
    • I
      performance counters: x86 support · 241771ef
      Ingo Molnar 提交于
      Implement performance counters for x86 Intel CPUs.
      
      It's simplified right now: the PERFMON CPU feature is assumed,
      which is available in Core2 and later Intel CPUs.
      
      The design is flexible to be extended to more CPU types as well.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      241771ef
    • Y
      x86: use NR_IRQS_LEGACY · 99d093d1
      Yinghai Lu 提交于
      Impact: cleanup
      
      Introduce NR_IRQS_LEGACY instead of hard coded number.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99d093d1
    • Y
      sparse irq_desc[] array: core kernel and x86 changes · 0b8f1efa
      Yinghai Lu 提交于
      Impact: new feature
      
      Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
      NR_CPUS set to large values. The goal is to be able to scale up to much
      larger NR_IRQS value without impacting the (important) common case.
      
      To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
      irq_desc pointers.
      
      When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
      this also makes the IRQ descriptors NUMA-local (to the site that calls
      request_irq()).
      
      This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
      uses desc->chip_data for x86 to store irq_cfg.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b8f1efa
  24. 06 11月, 2008 1 次提交