1. 03 9月, 2013 1 次提交
    • L
      lockref: implement lockless reference count updates using cmpxchg() · bc08b449
      Linus Torvalds 提交于
      Instead of taking the spinlock, the lockless versions atomically check
      that the lock is not taken, and do the reference count update using a
      cmpxchg() loop.  This is semantically identical to doing the reference
      count update protected by the lock, but avoids the "wait for lock"
      contention that you get when accesses to the reference count are
      contended.
      
      Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
      Even when the lockref reference counts are updated atomically with
      cmpxchg, the fact that they also verify the state of the spinlock means
      that the lockless updates can never happen while somebody else holds the
      spinlock.
      
      So while "lockref_put_or_lock()" looks a lot like just another name for
      "atomic_dec_and_lock()", and both optimize to lockless updates, they are
      fundamentally different: the decrement done by atomic_dec_and_lock() is
      truly independent of any lock (as long as it doesn't decrement to zero),
      so a locked region can still see the count change.
      
      The lockref structure, in contrast, really is a *locked* reference
      count.  If you hold the spinlock, the reference count will be stable and
      you can modify the reference count without using atomics, because even
      the lockless updates will see and respect the state of the lock.
      
      In order to enable the cmpxchg lockless code, the architecture needs to
      do three things:
      
       (1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
           in an aligned u64, and have a "cmpxchg()" implementation that works
           on such a u64 data type.
      
       (2) define a helper function to test for a spinlock being unlocked
           ("arch_spin_value_unlocked()")
      
       (3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
           Kconfig file.
      
      This enables it for x86-64 (but not 32-bit, we'd need to make sure
      cmpxchg() turns into the proper cmpxchg8b in order to enable it for
      32-bit mode).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc08b449
  2. 30 8月, 2013 3 次提交
  3. 20 8月, 2013 1 次提交
    • Y
      x86/ioapic/kcrash: Prevent crash_kexec() from deadlocking on ioapic_lock · 17405453
      Yoshihiro YUNOMAE 提交于
      Prevent crash_kexec() from deadlocking on ioapic_lock. When
      crash_kexec() is executed on a CPU, the CPU will take ioapic_lock
      in disable_IO_APIC(). So if the cpu gets an NMI while locking
      ioapic_lock, a deadlock will happen.
      
      In this patch, ioapic_lock is zapped/initialized before disable_IO_APIC().
      
      You can reproduce this deadlock the following way:
      
      1. Add mdelay(1000) after raw_spin_lock_irqsave() in
         native_ioapic_set_affinity()@arch/x86/kernel/apic/io_apic.c
      
         Although the deadlock can occur without this modification, it will increase
         the potential of the deadlock problem.
      
      2. Build and install the kernel
      
      3. Set up the OS which will run panic() and kexec when NMI is injected
          # echo "kernel.unknown_nmi_panic=1" >> /etc/sysctl.conf
          # vim /etc/default/grub
            add "nmi_watchdog=0 crashkernel=256M" in GRUB_CMDLINE_LINUX line
          # grub2-mkconfig
      
      4. Reboot the OS
      
      5. Run following command for each vcpu on the guest
          # while true; do echo <CPU num> > /proc/irq/<IO-APIC-edge or IO-APIC-fasteoi>/smp_affinitity; done;
         By running this command, cpus will get ioapic_lock for setting affinity.
      
      6. Inject NMI (push a dump button or execute 'virsh inject-nmi <domain>' if you
         use VM). After injecting NMI, panic() is called in an nmi-handler context.
         Then, kexec will normally run in panic(), but the operation will be stopped
         by deadlock on ioapic_lock in crash_kexec()->machine_crash_shutdown()->
         native_machine_crash_shutdown()->disable_IO_APIC()->clear_IO_APIC()->
         clear_IO_APIC_pin()->ioapic_read_entry().
      Signed-off-by: NYoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: yrl.pp-manager.tt@hitachi.com
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Link: http://lkml.kernel.org/r/20130820070107.28245.83806.stgit@yunodevelSigned-off-by: NIngo Molnar <mingo@kernel.org>
      17405453
  4. 14 8月, 2013 3 次提交
  5. 13 8月, 2013 2 次提交
    • O
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov 提交于
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
    • T
      x86, microcode, AMD: Fix early microcode loading · 84516098
      Torsten Kaiser 提交于
      load_microcode_amd() (and the helper it is using) should not have an
      cpu parameter. The microcode loading does not depend on the CPU wrt the
      patches loaded since they will end up in a global list for all CPUs
      anyway.
      
      The change from cpu to x86family in load_microcode_amd()
      now allows to drop the code messing with cpu_data(cpu) from
      collect_cpu_info_amd_early(), which is wrong anyway because at that
      point the per-cpu cpu_info is not yet setup (These values would later be
      overwritten by smp_store_boot_cpu_info() / smp_store_cpu_info()).
      
      Fold the rest of collect_cpu_info_amd_early() into load_ucode_amd_ap(),
      because its only used at one place and without the cpuinfo_x86 accesses
      it was not much left.
      Signed-off-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      [ Fengguang: build fix ]
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      [ Boris: adapt it to current tree. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      84516098
  6. 10 8月, 2013 1 次提交
    • D
      x86: Don't clear olpc_ofw_header when sentinel is detected · d55e37bb
      Daniel Drake 提交于
      OpenFirmware wasn't quite following the protocol described in boot.txt
      and the kernel has detected this through use of the sentinel value
      in boot_params. OFW does zero out almost all of the stuff that it should
      do, but not the sentinel.
      
      This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC
      support.
      
      OpenFirmware has now been fixed. However, it would be nice if we could
      maintain Linux compatibility with old firmware versions. To do that, we just
      have to avoid zeroing out olpc_ofw_header.
      
      OFW does not write to any other parts of the header that are being zapped
      by the sentinel-detection code, and all users of olpc_ofw_header are
      somewhat protected through checking for the OLPC_OFW_SIG magic value
      before using it. So this should not cause any problems for anyone.
      Signed-off-by: NDaniel Drake <dsd@laptop.org>
      Link: http://lkml.kernel.org/r/20130809221420.618E6FAB03@dev.laptop.orgAcked-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v3.9+
      d55e37bb
  7. 08 8月, 2013 1 次提交
  8. 07 8月, 2013 11 次提交
  9. 05 8月, 2013 4 次提交
  10. 03 8月, 2013 2 次提交
    • D
      x86: sysfb: move EFI quirks from efifb to sysfb · 2995e506
      David Herrmann 提交于
      The EFI FB quirks from efifb.c are useful for simple-framebuffer devices
      as well. Apply them by default so we can convert efifb.c to use
      efi-framebuffer platform devices.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Link: http://lkml.kernel.org/r/1375445127-15480-5-git-send-email-dh.herrmann@gmail.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      2995e506
    • D
      x86: provide platform-devices for boot-framebuffers · e3263ab3
      David Herrmann 提交于
      The current situation regarding boot-framebuffers (VGA, VESA/VBE, EFI) on
      x86 causes troubles when loading multiple fbdev drivers. The global
      "struct screen_info" does not provide any state-tracking about which
      drivers use the FBs. request_mem_region() theoretically works, but
      unfortunately vesafb/efifb ignore it due to quirks for broken boards.
      
      Avoid this by creating a platform framebuffer devices with a pointer
      to the "struct screen_info" as platform-data. Drivers can now create
      platform-drivers and the driver-core will refuse multiple drivers being
      active simultaneously.
      
      We keep the screen_info available for backwards-compatibility. Drivers
      can be converted in follow-up patches.
      
      Different devices are created for VGA/VESA/EFI FBs to allow multiple
      drivers to be loaded on distro kernels. We create:
       - "vesa-framebuffer" for VBE/VESA graphics FBs
       - "efi-framebuffer" for EFI FBs
       - "platform-framebuffer" for everything else
      This allows to load vesafb, efifb and others simultaneously and each
      picks up only the supported FB types.
      
      Apart from platform-framebuffer devices, this also introduces a
      compatibility option for "simple-framebuffer" drivers which recently got
      introduced for OF based systems. If CONFIG_X86_SYSFB is selected, we
      try to match the screen_info against a simple-framebuffer supported
      format. If we succeed, we create a "simple-framebuffer" device instead
      of a platform-framebuffer.
      This allows to reuse the simplefb.c driver across architectures and also
      to introduce a SimpleDRM driver. There is no need to have vesafb.c,
      efifb.c, simplefb.c and more just to have architecture specific quirks
      in their setup-routines.
      
      Instead, we now move the architecture specific quirks into x86-setup and
      provide a generic simple-framebuffer. For backwards-compatibility (if
      strange formats are used), we still allow vesafb/efifb to be loaded
      simultaneously and pick up all remaining devices.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Link: http://lkml.kernel.org/r/1375445127-15480-4-git-send-email-dh.herrmann@gmail.comTested-by: NStephen Warren <swarren@nvidia.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      e3263ab3
  11. 01 8月, 2013 1 次提交
  12. 30 7月, 2013 1 次提交
  13. 26 7月, 2013 1 次提交
  14. 23 7月, 2013 2 次提交
    • A
      perf/x86: Add ability to calculate TSC from perf sample timestamps · c73deb6a
      Adrian Hunter 提交于
      For modern CPUs, perf clock is directly related to TSC.  TSC
      can be calculated from perf clock and vice versa using a simple
      calculation.  Two of the three componenets of that calculation
      are already exported in struct perf_event_mmap_page.  This patch
      exports the third.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1372425741-1676-3-git-send-email-adrian.hunter@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c73deb6a
    • J
      kprobes/x86: Call out into INT3 handler directly instead of using notifier · 17f41571
      Jiri Kosina 提交于
      In fd4363ff ("x86: Introduce int3 (breakpoint)-based
      instruction patching"), the mechanism that was introduced for
      notifying alternatives code from int3 exception handler that and
      exception occured was die_notifier.
      
      This is however problematic, as early code might be using jump
      labels even before the notifier registration has been performed,
      which will then lead to an oops due to unhandled exception. One
      of such occurences has been encountered by Fengguang:
      
       int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8
       task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000
       RIP: 0010:[<ffffffff811098cc>]  [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225
       RSP: 0000:ffff88000dd03f10  EFLAGS: 00000006
       RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001
       RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002
       R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0
       R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8
       FS:  0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0
       Stack:
        ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48
        ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68
        ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98
       Call Trace:
        <IRQ>
        [<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79
        [<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84
        [<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0
        [<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41
        [<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80
        <EOI>
        [<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1
        [<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16
        [<ffffffff81015f10>] default_idle+0x147/0x282
        [<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d
        [<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db
        [<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84
        [<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5
        [...]
      
      Fix this by directly calling poke_int3_handler() from the int3
      exception handler (analogically to what ftrace has been doing
      already), instead of relying on notifier, registration of which
      might not have yet been finalized by the time of the first trap.
      Reported-and-tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307231007490.14024@pobox.suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      17f41571
  15. 19 7月, 2013 1 次提交
  16. 17 7月, 2013 2 次提交
    • J
      x86: Introduce int3 (breakpoint)-based instruction patching · fd4363ff
      Jiri Kosina 提交于
      Introduce a method for run-time instruction patching on a live SMP kernel
      based on int3 breakpoint, completely avoiding the need for stop_machine().
      
      The way this is achieved:
      
      	- add a int3 trap to the address that will be patched
      	- sync cores
      	- update all but the first byte of the patched range
      	- sync cores
      	- replace the first byte (int3) by the first byte of
      	  replacing opcode
      	- sync cores
      
      According to
      
      	http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01530.html
      
      synchronization after replacing "all but first" instructions should not
      be necessary (on Intel hardware), as the syncing after the subsequent
      patching of the first byte provides enough safety.
      But there's not only Intel HW out there, and we'd rather be on a safe
      side.
      
      If any CPU instruction execution would collide with the patching,
      it'd be trapped by the int3 breakpoint and redirected to the provided
      "handler" (which would typically mean just skipping over the patched
      region, acting as "nop" has been there, in case we are doing nop -> jump
      and jump -> nop transitions).
      
      Ftrace has been using this very technique since 08d636b6 ("ftrace/x86:
      Have arch x86_64 use breakpoints instead of stop machine") for ages
      already, and jump labels are another obvious potential user of this.
      
      Based on activities of Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      a few years ago.
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307121102440.29788@pobox.suse.czSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      fd4363ff
    • H
      x86, bitops: Change bitops to be native operand size · 9b710506
      H. Peter Anvin 提交于
      Change the bitops operation to be naturally "long", i.e. 63 bits on
      the 64-bit kernel.  Additional bugs are likely to crop up in the
      future.
      
      We already have bugs which machines with > 16 TiB of memory in a
      single node, as can happen if memory is interleaved.  The x86 bitop
      operations take a signed index, so using an unsigned type is not an
      option.
      
      Jim Kukunas measured the effect of this patch on kernel size: it adds
      2779 bytes to the allyesconfig kernel.  Some of that probably could be
      elided by replacing the inline functions with macros which select the
      32-bit type if the index is a 32-bit value, something like:
      
      In that case we could also use "Jr" constraints for the 64-bit
      version.
      
      However, this would more than double the amount of code for a
      relatively small gain.
      
      Note that we can't use ilog2() for _BITOPS_LONG_SHIFT, as that causes
      a recursive header inclusion problem.
      
      The change to constant_test_bit() should both generate better code and
      give correct result for negative bit indicies.  As previously written
      the compiler had to generate extra code to create the proper wrong
      result for negative values.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Jim Kukunas <james.t.kukunas@intel.com>
      Link: http://lkml.kernel.org/n/tip-z61ofiwe90xeyb461o72h8ya@git.kernel.org
      9b710506
  17. 15 7月, 2013 1 次提交
    • P
      x86: delete __cpuinit usage from all x86 files · 148f9bb8
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      Note that some harmless section mismatch warnings may result, since
      notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
      are flagged as __cpuinit  -- so if we remove the __cpuinit from
      arch specific callers, we will also get section mismatch warnings.
      As an intermediate step, we intend to turn the linux/init.h cpuinit
      content into no-ops as early as possible, since that will get rid
      of these warnings.  In any case, they are temporary and harmless.
      
      This removes all the arch/x86 uses of the __cpuinit macros from
      all C files.  x86 only had the one __CPUINIT used in assembly files,
      and it wasn't paired off with a .previous or a __FINIT, so we can
      delete it directly w/o any corresponding additional change there.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      148f9bb8
  18. 10 7月, 2013 1 次提交
  19. 04 7月, 2013 1 次提交