1. 23 9月, 2008 2 次提交
    • T
      x86: prevent C-states hang on AMD C1E enabled machines · a8d68290
      Thomas Gleixner 提交于
      Impact: System hang when AMD C1E machines switch into C2/C3
      
      AMD C1E enabled systems do not work with normal ACPI C-states 
      even if the BIOS is advertising them. Limit the C-states to 
      C1 for the ACPI processor idle code.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a8d68290
    • T
      x86: prevent stale state of c1e_mask across CPU offline/online · 4faac97d
      Thomas Gleixner 提交于
      Impact: hang which happens across CPU offline/online on AMD C1E systems.
      
      When a CPU goes offline then the corresponding bit in the broadcast
      mask is cleared. For AMD C1E enabled CPUs we do not reenable the
      broadcast when the CPU comes online again as we do not clear the
      corresponding bit in the c1e_mask, which keeps track which CPUs
      have been switched to broadcast already. So on those !$@#& machines
      we never switch back to broadcasting after a CPU offline/online cycle.
      
      Clear the bit when the CPU plays dead.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4faac97d
  2. 06 9月, 2008 1 次提交
  3. 26 8月, 2008 2 次提交
  4. 25 8月, 2008 2 次提交
  5. 24 8月, 2008 1 次提交
  6. 23 8月, 2008 1 次提交
    • R
      x86 MCE: Fix CPU hotplug problem with multiple multicore AMD CPUs · 8735728e
      Rafael J. Wysocki 提交于
      During CPU hot-remove the sysfs directory created by
      threshold_create_bank(), defined in
      arch/x86/kernel/cpu/mcheck/mce_amd_64.c, has to be removed before
      its parent directory, created by mce_create_device(), defined in
      arch/x86/kernel/cpu/mcheck/mce_64.c .  Moreover, when the CPU in
      question is hotplugged again, obviously the latter has to be created
      before the former.  At present, the right ordering is not enforced,
      because all of these operations are carried out by CPU hotplug
      notifiers which are not appropriately ordered with respect to each
      other.  This leads to serious problems on systems with two or more
      multicore AMD CPUs, among other things during suspend and hibernation.
      
      Fix the problem by placing threshold bank CPU hotplug callbacks in
      mce_cpu_callback(), so that they are invoked at the right places,
      if defined.  Additionally, use kobject_del() to remove the sysfs
      directory associated with the kobject created by
      kobject_create_and_add() in threshold_create_bank(), to prevent the
      kernel from crashing during CPU hotplug operations on systems with
      two or more multicore AMD CPUs.
      
      This patch fixes bug #11337.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NAndi Kleen <andi@firstfloor.org>
      Tested-by: NMark Langsdorf <mark.langsdorf@amd.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8735728e
  7. 22 8月, 2008 1 次提交
    • M
      x86: fix section mismatch warning - uv_cpu_init · c4bd1fda
      Marcin Slusarz 提交于
      WARNING: vmlinux.o(.cpuinit.text+0x3cc4): Section mismatch in reference from the function uv_cpu_init() to the function .init.text:uv_system_init()
      The function __cpuinit uv_cpu_init() references
      a function __init uv_system_init().
      If uv_system_init is only used by uv_cpu_init then
      annotate uv_system_init with a matching annotation.
      
      uv_system_init was ment to be called only once, so do it from codepath
      (native_smp_prepare_cpus) which is called once, right before activation
      of other cpus (smp_init).
      
      Note: old code relied on uv_node_to_blade being initialized to 0,
      but it'a not initialized from anywhere.
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      Acked-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c4bd1fda
  8. 20 8月, 2008 1 次提交
  9. 18 8月, 2008 3 次提交
    • M
      x86, percpu: silence section mismatch warnings related to EARLY_PER_CPU variables · c6a92a25
      Marcin Slusarz 提交于
      Quoting Mike Travis in "x86: cleanup early per cpu variables/accesses v4"
      (23ca4bba):
      
          The DEFINE macro defines the per_cpu variable as well as the early
          map and pointer.  It also initializes the per_cpu variable and map
          elements to "_initvalue".  The early_* macros provide access to
          the initial map (usually setup during system init) and the early
          pointer.  This pointer is initialized to point to the early map
          but is then NULL'ed when the actual per_cpu areas are setup.  After
          that the per_cpu variable is the correct access to the variable.
      
      As these variables are NULL'ed before __init sections are dropped
      (in setup_per_cpu_maps), they can be safely annotated as __ref.
      
      This change silences following section mismatch warnings:
      
      WARNING: vmlinux.o(.data+0x46c0): Section mismatch in reference from the variable x86_cpu_to_apicid_early_ptr to the variable .init.data:x86_cpu_to_apicid_early_map
      The variable x86_cpu_to_apicid_early_ptr references
      the variable __initdata x86_cpu_to_apicid_early_map
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      
      WARNING: vmlinux.o(.data+0x46c8): Section mismatch in reference from the variable x86_bios_cpu_apicid_early_ptr to the variable .init.data:x86_bios_cpu_apicid_early_map
      The variable x86_bios_cpu_apicid_early_ptr references
      the variable __initdata x86_bios_cpu_apicid_early_map
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      
      WARNING: vmlinux.o(.data+0x46d0): Section mismatch in reference from the variable x86_cpu_to_node_map_early_ptr to the variable .init.data:x86_cpu_to_node_map_early_map
      The variable x86_cpu_to_node_map_early_ptr references
      the variable __initdata x86_cpu_to_node_map_early_map
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c6a92a25
    • M
      x86: mmconf: fix section mismatch warning · c72a5efe
      Marcin Slusarz 提交于
      WARNING: arch/x86/kernel/built-in.o(.cpuinit.text+0x1591): Section mismatch in reference from the function init_amd() to the function .init.text:check_enable_amd_mmconf_dmi()
      The function __cpuinit init_amd() references
      a function __init check_enable_amd_mmconf_dmi().
      If check_enable_amd_mmconf_dmi is only used by init_amd then
      annotate check_enable_amd_mmconf_dmi with a matching annotation.
      
      check_enable_amd_mmconf_dmi is only called from init_amd which is __cpuinit
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c72a5efe
    • M
      x86: correct register constraints for 64-bit atomic operations · 3c3b5c3b
      Mathieu Desnoyers 提交于
      x86_64 add/sub atomic ops does not seems to accept integer values bigger
      than 32 bits as immediates. Intel's add/sub documentation specifies they
      have to be passed as registers.
      
      The only operations in the x86-64 architecture which accept arbitrary
      64-bit immediates is "movq" to any register; similarly, the only
      operation which accept arbitrary 64-bit displacement is "movabs" to or
      from al/ax/eax/rax.
      
      http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Machine-Constraints.html
      
      states :
      
      e
          32-bit signed integer constant, or a symbolic reference known to fit
          that range (for immediate operands in sign-extending x86-64
          instructions).
      Z
          32-bit unsigned integer constant, or a symbolic reference known to
          fit that range (for immediate operands in zero-extending x86-64
          instructions).
      
      Since add/sub does sign extension, using the "e" constraint seems appropriate.
      
      It applies to 2.6.27-rc, 2.6.26, 2.6.25...
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c3b5c3b
  10. 16 8月, 2008 2 次提交
    • I
      x86: add MAP_STACK mmap flag · cd98a04a
      Ingo Molnar 提交于
      as per this discussion:
      
         http://lkml.org/lkml/2008/8/12/423
      
      Pardo reported that 64-bit threaded apps, if their stacks exceed the
      combined size of ~4GB, slow down drastically in pthread_create() - because
      glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
      a legacy hack - to speed up context switching on certain early model
      64-bit P4 CPUs.
      
      So introduce a new flag to be used by glibc instead, to not constrain
      64-bit apps like this.
      
      glibc can switch to this new flag straight away - it will be ignored
      by the kernel. If those old CPUs ever matter to anyone, support for
      it can be implemented.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NUlrich Drepper <drepper@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd98a04a
    • I
      x86: add MAP_STACK mmap flag · 2fdc8690
      Ingo Molnar 提交于
      as per this discussion:
      
         http://lkml.org/lkml/2008/8/12/423
      
      Pardo reported that 64-bit threaded apps, if their stacks exceed the
      combined size of ~4GB, slow down drastically in pthread_create() - because
      glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is
      a legacy hack - to speed up context switching on certain early model
      64-bit P4 CPUs.
      
      So introduce a new flag to be used by glibc instead, to not constrain
      64-bit apps like this.
      
      glibc can switch to this new flag straight away - it will be ignored
      by the kernel. If those old CPUs ever matter to anyone, support for
      it can be implemented.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NUlrich Drepper <drepper@gmail.com>
      2fdc8690
  11. 15 8月, 2008 10 次提交
  12. 13 8月, 2008 2 次提交
    • S
      crypto: padlock - fix VIA PadLock instruction usage with irq_ts_save/restore() · e4914012
      Suresh Siddha 提交于
      Wolfgang Walter reported this oops on his via C3 using padlock for
      AES-encryption:
      
      ##################################################################
      
      BUG: unable to handle kernel NULL pointer dereference at 000001f0
      IP: [<c01028c5>] __switch_to+0x30/0x117
      *pde = 00000000
      Oops: 0002 [#1] PREEMPT
      Modules linked in:
      
      Pid: 2071, comm: sleep Not tainted (2.6.26 #11)
      EIP: 0060:[<c01028c5>] EFLAGS: 00010002 CPU: 0
      EIP is at __switch_to+0x30/0x117
      EAX: 00000000 EBX: c0493300 ECX: dc48dd00 EDX: c0493300
      ESI: dc48dd00 EDI: c0493530 EBP: c04cff8c ESP: c04cff7c
       DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
      Process sleep (pid: 2071, ti=c04ce000 task=dc48dd00 task.ti=d2fe6000)
      Stack: dc48df30 c0493300 00000000 00000000 d2fe7f44 c03b5b43 c04cffc8 00000046
             c0131856 0000005a dc472d3c c0493300 c0493470 d983ae00 00002696 00000000
             c0239f54 00000000 c04c4000 c04cffd8 c01025fe c04f3740 00049800 c04cffe0
      Call Trace:
       [<c03b5b43>] ? schedule+0x285/0x2ff
       [<c0131856>] ? pm_qos_requirement+0x3c/0x53
       [<c0239f54>] ? acpi_processor_idle+0x0/0x434
       [<c01025fe>] ? cpu_idle+0x73/0x7f
       [<c03a4dcd>] ? rest_init+0x61/0x63
       =======================
      
      Wolfgang also found out that adding kernel_fpu_begin() and kernel_fpu_end()
      around the padlock instructions fix the oops.
      
      Suresh wrote:
      
      These padlock instructions though don't use/touch SSE registers, but it behaves
      similar to other SSE instructions. For example, it might cause DNA faults
      when cr0.ts is set. While this is a spurious DNA trap, it might cause
      oops with the recent fpu code changes.
      
      This is the code sequence  that is probably causing this problem:
      
      a) new app is getting exec'd and it is somewhere in between
         start_thread() and flush_old_exec() in the load_xyz_binary()
      
      b) At pont "a", task's fpu state (like TS_USEDFPU, used_math() etc) is
         cleared.
      
      c) Now we get an interrupt/softirq which starts using these encrypt/decrypt
         routines in the network stack. This generates a math fault (as
         cr0.ts is '1') which sets TS_USEDFPU and restores the math that is
         in the task's xstate.
      
      d) Return to exec code path, which does start_thread() which does
         free_thread_xstate() and sets xstate pointer to NULL while
         the TS_USEDFPU is still set.
      
      e) At the next context switch from the new exec'd task to another task,
         we have a scenarios where TS_USEDFPU is set but xstate pointer is null.
         This can cause an oops during unlazy_fpu() in __switch_to()
      
      Now:
      
      1) This should happen with or with out pre-emption. Viro also encountered
         similar problem with out CONFIG_PREEMPT.
      
      2) kernel_fpu_begin() and kernel_fpu_end() will fix this problem, because
         kernel_fpu_begin() will manually do a clts() and won't run in to the
         situation of setting TS_USEDFPU in step "c" above.
      
      3) This was working before the fpu changes, because its a spurious
         math fault  which doesn't corrupt any fpu/sse registers and the task's
         math state was always in an allocated state.
      
      With out the recent lazy fpu allocation changes, while we don't see oops,
      there is a possible race still present in older kernels(for example,
      while kernel is using kernel_fpu_begin() in some optimized clear/copy
      page and an interrupt/softirq happens which uses these padlock
      instructions generating DNA fault).
      
      This is the failing scenario that existed even before the lazy fpu allocation
      changes:
      
      0. CPU's TS flag is set
      
      1. kernel using FPU in some optimized copy  routine and while doing
      kernel_fpu_begin() takes an interrupt just before doing clts()
      
      2. Takes an interrupt and ipsec uses padlock instruction. And we
      take a DNA fault as TS flag is still set.
      
      3. We handle the DNA fault and set TS_USEDFPU and clear cr0.ts
      
      4. We complete the padlock routine
      
      5. Go back to step-1, which resumes clts() in kernel_fpu_begin(), finishes
      the optimized copy routine and does kernel_fpu_end(). At this point,
      we have cr0.ts again set to '1' but the task's TS_USEFPU is stilll
      set and not cleared.
      
      6. Now kernel resumes its user operation. And at the next context
      switch, kernel sees it has do a FP save as TS_USEDFPU is still set
      and then will do a unlazy_fpu() in __switch_to(). unlazy_fpu()
      will take a DNA fault, as cr0.ts is '1' and now, because we are
      in __switch_to(), math_state_restore() will get confused and will
      restore the next task's FP state and will save it in prev tasks's FP state.
      Remember, in __switch_to() we are already on the stack of the next task
      but take a DNA fault for the prev task.
      
      This causes the fpu leakage.
      
      Fix the padlock instruction usage by calling them inside the
      context of new routines irq_ts_save/restore(), which clear/restore cr0.ts
      manually in the interrupt context. This will not generate spurious DNA
      in the  context of the interrupt which will fix the oops encountered and
      the possible FPU leakage issue.
      Reported-and-bisected-by: NWolfgang Walter <wolfgang.walter@stwm.de>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      e4914012
    • J
      x86: propagate new nonpanic bootmem macros to CONFIG_HAVE_ARCH_BOOTMEM_NODE · 0ed89b06
      Johannes Weiner 提交于
      Commit 74768ed8 "page allocator: use no-panic variant of
      alloc_bootmem() in alloc_large_system_hash()" introduced two new
      _nopanic macros which are undefined for CONFIG_HAVE_ARCH_BOOTMEM_NODE.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Acked-by: N"Jan Beulich" <jbeulich@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ed89b06
  13. 12 8月, 2008 1 次提交
  14. 11 8月, 2008 2 次提交
    • E
      x86_64: restore the proper NR_IRQS define so larger systems work. · 3c7569b2
      Eric W. Biederman 提交于
      As pointed out and tracked by Yinghai Lu <yhlu.kernel@gmail.com>:
      
       Dhaval Giani got:
       kernel BUG at arch/x86/kernel/io_apic_64.c:357!
       invalid opcode: 0000 [1] SMP
       CPU 24
       ...
      
      his system (x3950) has 8 ioapic, irq > 256
      
      This was caused by:
      
             commit 9b7dc567
             Author: Thomas Gleixner <tglx@linutronix.de>
             Date:   Fri May 2 20:10:09 2008 +0200
      
                x86: unify interrupt vector defines
      
                The interrupt vector defines are copied 4 times around with minimal
                differences. Move them all into asm-x86/irq_vectors.h
      
      It appears that Thomas did not notice that x86_64 does something
      completely different when he merge irq_vectors.h
      
      We can solve this for 2.6.27 by simply reintroducing the old heuristic
      for setting NR_IRQS on x86_64 to a usable value, which trivially removes
      the regression.
      
      Long term it would be nice to harmonize the handling of ioapic interrupts
      of x86_32 and x86_64 so we don't have this kind of confusion.
      
      Dhaval Giani <dhaval@linux.vnet.ibm.com> tested an earlier version of
      this patch by YH which confirms simply increasing NR_IRQS fixes the
      problem.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Acked-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c7569b2
    • E
      x86: Restore proper vector locking during cpu hotplug · d388e5fd
      Eric W. Biederman 提交于
      Having cpu_online_map change during assign_irq_vector can result
      in some really nasty and weird things happening.  The one that
      bit me last time was accessing non existent per cpu memory for non
      existent cpus.
      
      This locking was removed in a sloppy x86_64 and x86_32 merge patch.
      
      Guys can we please try and avoid subtly breaking x86 when we are
      merging files together?
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      d388e5fd
  15. 29 7月, 2008 2 次提交
  16. 27 7月, 2008 7 次提交
    • J
      KVM: SVM: allow enabling/disabling NPT by reloading only the architecture module · 5f4cb662
      Joerg Roedel 提交于
      If NPT is enabled after loading both KVM modules on AMD and it should be
      disabled, both KVM modules must be reloaded. If only the architecture module is
      reloaded the behavior is undefined. With this patch it is possible to disable
      NPT only by reloading the kvm_amd module.
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      5f4cb662
    • A
      [PATCH] kill altroot · 7f2da1e7
      Al Viro 提交于
      long overdue...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7f2da1e7
    • N
      x86: lockless get_user_pages_fast() · 8174c430
      Nick Piggin 提交于
      Implement get_user_pages_fast without locking in the fastpath on x86.
      
      Do an optimistic lockless pagetable walk, without taking mmap_sem or any
      page table locks or even mmap_sem.  Page table existence is guaranteed by
      turning interrupts off (combined with the fact that we're always looking
      up the current mm, means we can do the lockless page table walk within the
      constraints of the TLB shootdown design).  Basically we can do this
      lockless pagetable walk in a similar manner to the way the CPU's pagetable
      walker does not have to take any locks to find present ptes.
      
      This patch (combined with the subsequent ones to convert direct IO to use
      it) was found to give about 10% performance improvement on a 2 socket 8
      core Intel Xeon system running an OLTP workload on DB2 v9.5
      
       "To test the effects of the patch, an OLTP workload was run on an IBM
        x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
        2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel.  Comparing
        runs with and without the patch resulted in an overall performance
        benefit of ~9.8%.  Correspondingly, oprofiles showed that samples from
        __up_read and __down_read routines that is seen during thread contention
        for system resources was reduced from 2.8% down to .05%.  Monitoring the
        /proc/vmstat output from the patched run showed that the counter for
        fast_gup contained a very high number while the fast_gup_slow value was
        zero."
      
      (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
      counter we had for the number of times the slowpath was invoked).
      
      The main reason for the improvement is that DB2 has multiple threads each
      issuing direct-IO.  Direct-IO uses get_user_pages, and thus the threads
      contend the mmap_sem cacheline, and can also contend on page table locks.
      
      I would anticipate larger performance gains on larger systems, however I
      think DB2 uses an adaptive mix of threads and processes, so it could be
      that thread contention remains pretty constant as machine size increases.
      In which case, we stuck with "only" a 10% gain.
      
      The downside of using get_user_pages_fast is that if there is not a pte
      with the correct permissions for the access, we end up falling back to
      get_user_pages and so the get_user_pages_fast is a bit of extra work.
      However this should not be the common case in most performance critical
      code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: Kconfig fix]
      [akpm@linux-foundation.org: Makefile fix/cleanup]
      [akpm@linux-foundation.org: warning fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8174c430
    • N
      x86: implement pte_special · a0a8f536
      Nick Piggin 提交于
      Implement the pte_special bit for x86.  This is required to support
      lockless get_user_pages, because we need to know whether or not we can
      refcount a particular page given only its pte (and no vma).
      
      [hugh@veritas.com: fix a BUG]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0a8f536
    • H
      kexec jump · 3ab83521
      Huang Ying 提交于
      This patch provides an enhancement to kexec/kdump.  It implements the
      following features:
      
      - Backup/restore memory used by the original kernel before/after
        kexec.
      
      - Save/restore CPU state before/after kexec.
      
      The features of this patch can be used as a general method to call program in
      physical mode (paging turning off).  This can be used to call BIOS code under
      Linux.
      
      kexec-tools needs to be patched to support kexec jump. The patches and
      the precompiled kexec can be download from the following URL:
      
             source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
             patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
             binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
      
      Usage example of calling some physical mode code and return:
      
      1. Compile and install patched kernel with following options selected:
      
      CONFIG_X86_32=y
      CONFIG_KEXEC=y
      CONFIG_PM=y
      CONFIG_KEXEC_JUMP=y
      
      2. Build patched kexec-tool or download the pre-built one.
      
      3. Build some physical mode executable named such as "phy_mode"
      
      4. Boot kernel compiled in step 1.
      
      5. Load physical mode executable with /sbin/kexec. The shell command
         line can be as follow:
      
         /sbin/kexec --load-preserve-context --args-none phy_mode
      
      6. Call physical mode executable with following shell command line:
      
         /sbin/kexec -e
      
      Implementation point:
      
      To support jumping without reserving memory.  One shadow backup page (source
      page) is allocated for each page used by kexeced code image (destination
      page).  When do kexec_load, the image of kexeced code is loaded into source
      pages, and before executing, the destination pages and the source pages are
      swapped, so the contents of destination pages are backupped.  Before jumping
      to the kexeced code image and after jumping back to the original kernel, the
      destination pages and the source pages are swapped too.
      
      C ABI (calling convention) is used as communication protocol between
      kernel and called code.
      
      A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to
      indicate that the loaded kernel image is used for jumping back.
      
      Now, only the i386 architecture is supported.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ab83521
    • A
      x86 calgary: fix handling of devices that aren't behind the Calgary · 1956a96d
      Alexis Bruemmer 提交于
      The calgary code can give drivers addresses above 4GB which is very bad
      for hardware that is only 32bit DMA addressable.
      
      With this patch, the calgary code sets the global dma_ops to swiotlb or
      nommu properly, and the dma_ops of devices behind the Calgary/CalIOC2
      to calgary_dma_ops.  So the calgary code can handle devices safely that
      aren't behind the Calgary/CalIOC2.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAlexis Bruemmer <alexisb@us.ibm.com>
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Muli Ben-Yehuda <muli@il.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1956a96d
    • F
      dma-mapping: add the device argument to dma_mapping_error() · 8d8bb39b
      FUJITA Tomonori 提交于
      Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
      architecture does:
      
      This enables us to cleanly fix the Calgary IOMMU issue that some devices
      are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
      
      I think that per-device dma_mapping_ops support would be also helpful for
      KVM people to support PCI passthrough but Andi thinks that this makes it
      difficult to support the PCI passthrough (see the above thread).  So I
      CC'ed this to KVM camp.  Comments are appreciated.
      
      A pointer to dma_mapping_ops to struct dev_archdata is added.  If the
      pointer is non NULL, DMA operations in asm/dma-mapping.h use it.  If it's
      NULL, the system-wide dma_ops pointer is used as before.
      
      If it's useful for KVM people, I plan to implement a mechanism to register
      a hook called when a new pci (or dma capable) device is created (it works
      with hot plugging).  It enables IOMMUs to set up an appropriate
      dma_mapping_ops per device.
      
      The major obstacle is that dma_mapping_error doesn't take a pointer to the
      device unlike other DMA operations.  So x86 can't have dma_mapping_ops per
      device.  Note all the POWER IOMMUs use the same dma_mapping_error function
      so this is not a problem for POWER but x86 IOMMUs use different
      dma_mapping_error functions.
      
      The first patch adds the device argument to dma_mapping_error.  The patch
      is trivial but large since it touches lots of drivers and dma-mapping.h in
      all the architecture.
      
      This patch:
      
      dma_mapping_error() doesn't take a pointer to the device unlike other DMA
      operations.  So we can't have dma_mapping_ops per device.
      
      Note that POWER already has dma_mapping_ops per device but all the POWER
      IOMMUs use the same dma_mapping_error function.  x86 IOMMUs use device
      argument.
      
      [akpm@linux-foundation.org: fix sge]
      [akpm@linux-foundation.org: fix svc_rdma]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix bnx2x]
      [akpm@linux-foundation.org: fix s2io]
      [akpm@linux-foundation.org: fix pasemi_mac]
      [akpm@linux-foundation.org: fix sdhci]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix sparc]
      [akpm@linux-foundation.org: fix ibmvscsi]
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Muli Ben-Yehuda <muli@il.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d8bb39b