1. 08 5月, 2015 1 次提交
  2. 07 5月, 2015 14 次提交
    • P
      KVM: x86: dump VMCS on invalid entry · 4eb64dce
      Paolo Bonzini 提交于
      Code and format roughly based on Xen's vmcs_dump_vcpu.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4eb64dce
    • M
      x86: kvmclock: drop rdtsc_barrier() · a3eb97bd
      Marcelo Tosatti 提交于
      Drop unnecessary rdtsc_barrier(), as has been determined empirically,
      see 057e6a8c for details.
      
      Noticed by Andy Lutomirski.
      
      Improves clock_gettime() by approximately 15% on
      Intel i7-3520M @ 2.90GHz.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3eb97bd
    • J
      KVM: x86: drop unneeded null test · d90e3a35
      Julia Lawall 提交于
      If the null test is needed, the call to cancel_delayed_work_sync would have
      already crashed.  Normally, the destroy function should only be called
      if the init function has succeeded, in which case ioapic is not null.
      
      Problem found using Coccinelle.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d90e3a35
    • R
      KVM: x86: fix initial PAT value · 74545705
      Radim Krčmář 提交于
      PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
      VMX used a wrong value (host's PAT) and while SVM used the right one,
      it never got to arch.pat.
      
      This is not an issue with QEMU as it will force the correct value.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74545705
    • R
      kvm,x86: load guest FPU context more eagerly · 653f52c3
      Rik van Riel 提交于
      Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
      re-load them every time the guest accesses the FPU after a switch back into
      the guest from the host.
      
      This patch copies the x86 task switch semantics for FPU loading, with the
      FPU loaded eagerly after first use if the system uses eager fpu mode,
      or if the guest uses the FPU frequently.
      
      In the latter case, after loading the FPU for 255 times, the fpu_counter
      will roll over, and we will revert to loading the FPU on demand, until
      it has been established that the guest is still actively using the FPU.
      
      This mirrors the x86 task switch policy, which seems to work.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      653f52c3
    • J
      kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true · d1ebdbf9
      James Sullivan 提交于
      An MSI interrupt should only be delivered to the lowest priority CPU
      when it has RH=1, regardless of the delivery mode. Modified
      kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
      or irq->msi_redir_hint.
      
      Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
      kvm_lowest_prio_delivery().
      
      Changed a check in kvm_irq_delivery_to_apic_fast() from
      irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1ebdbf9
    • J
      kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery · 93bbf0b8
      James Sullivan 提交于
      Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
      be used to determine if the delivery of the MSI should target only
      the lowest priority CPU in the logical group specified for delivery.
      (In physical dest mode, the RH bit is not relevant). Initialized the value
      of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
      to false in all other cases.
      
      Added value of msi_redir_hint to a debug message dump of an IRQ in
      apic_send_ipi().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93bbf0b8
    • P
      KVM: x86: tweak types of fields in kvm_lapic_irq · b7cb2231
      Paolo Bonzini 提交于
      Change to u16 if they only contain data in the low 16 bits.
      
      Change the level field to bool, since we assign 1 sometimes, but
      just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7cb2231
    • N
      KVM: x86: INIT and reset sequences are different · d28bc9dd
      Nadav Amit 提交于
      x86 architecture defines differences between the reset and INIT sequences.
      INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
      MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
      
      References (from Intel SDM):
      
      "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
      to a specific processor or system wide) do not cause the MP protocol to be
      repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
      
      [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
      
      "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
      changed." [9.2: X87 FPU INITIALIZATION]
      
      "The state of the local APIC following an INIT reset is the same as it is after
      a power-up or hardware reset, except that the APIC ID and arbitration ID
      registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
      ("Wait-for-SIPI" State)]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28bc9dd
    • N
      KVM: x86: Support for disabling quirks · 90de4a18
      Nadav Amit 提交于
      Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
      created in order to overcome QEMU issues. Those issue were mostly result of
      invalid VM BIOS.  Currently there are two quirks that can be disabled:
      
      1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
      2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
      
      These two issues are already resolved in recent releases of QEMU, and would
      therefore be disabled by QEMU.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
      [Report capability from KVM_CHECK_EXTENSION too. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90de4a18
    • P
      KVM: booke: use __kvm_guest_exit · e233d54d
      Paolo Bonzini 提交于
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e233d54d
    • C
      KVM: arm/mips/x86/power use __kvm_guest_{enter|exit} · ccf73aaf
      Christian Borntraeger 提交于
      Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
      where interrupts are disabled.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ccf73aaf
    • C
      KVM: provide irq_unsafe kvm_guest_{enter|exit} · 0097d12e
      Christian Borntraeger 提交于
      Several kvm architectures disable interrupts before kvm_guest_enter.
      kvm_guest_enter then uses local_irq_save/restore to disable interrupts
      again or for the first time. Lets provide underscore versions of
      kvm_guest_{enter|exit} that assume being called locked.
      kvm_guest_enter now disables interrupts for the full function and
      thus we can remove the check for preemptible.
      
      This patch then adopts s390/kvm to use local_irq_disable/enable calls
      which are slighty cheaper that local_irq_save/restore and call these
      new functions.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0097d12e
    • L
      kvmclock: set scheduler clock stable · ff7bbb9c
      Luiz Capitulino 提交于
      If you try to enable NOHZ_FULL on a guest today, you'll get
      the following error when the guest tries to deactivate the
      scheduler tick:
      
       WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
       NO_HZ FULL will not work with unstable sched clock
       CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb7 #204
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       Workqueue: events flush_to_ldisc
        ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
        ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
        0000000000000000 0000000000000003 0000000000000001 0000000000000001
       Call Trace:
        <IRQ>  [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
        [<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
        [<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
        [<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
        [<ffffffff810511c5>] irq_exit+0xc5/0x130
        [<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
        [<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
        <EOI>  [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
        [<ffffffff8108bbc8>] __wake_up+0x48/0x60
        [<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
        [<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
        [<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
        [<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
        [<ffffffff81064d05>] process_one_work+0x1d5/0x540
        [<ffffffff81064c81>] ? process_one_work+0x151/0x540
        [<ffffffff81065191>] worker_thread+0x121/0x470
        [<ffffffff81065070>] ? process_one_work+0x540/0x540
        [<ffffffff8106b4df>] kthread+0xef/0x110
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
        [<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
       ---[ end trace 06e3507544a38866 ]---
      
      However, it turns out that kvmclock does provide a stable
      sched_clock callback. So, let the scheduler know this which
      in turn makes NOHZ_FULL work in the guest.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff7bbb9c
  3. 04 5月, 2015 8 次提交
  4. 03 5月, 2015 3 次提交
    • J
      ext4: fix growing of tiny filesystems · 2c869b26
      Jan Kara 提交于
      The estimate of necessary transaction credits in ext4_flex_group_add()
      is too pessimistic. It reserves credit for sb, resize inode, and resize
      inode dindirect block for each group added in a flex group although they
      are always the same block and thus it is enough to account them only
      once. Also the number of modified GDT block is overestimated since we
      fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
      
      Make the estimation more precise. That reduces number of requested
      credits enough that we can grow 20 MB filesystem (which has 1 MB
      journal, 79 reserved GDT blocks, and flex group size 16 by default).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      2c869b26
    • D
      ext4: move check under lock scope to close a race. · 280227a7
      Davide Italiano 提交于
      fallocate() checks that the file is extent-based and returns
      EOPNOTSUPP in case is not. Other tasks can convert from and to
      indirect and extent so it's safe to check only after grabbing
      the inode mutex.
      Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      280227a7
    • L
      ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d
      Lukas Czerner 提交于
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d2dc317d
  5. 02 5月, 2015 10 次提交
  6. 01 5月, 2015 4 次提交
    • L
      Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 64887b68
      Linus Torvalds 提交于
      Pull btrfs fixes from Chris Mason:
       "A few more btrfs fixes.
      
        These range from corners Filipe found in the new free space cache
        writeback to a grab bag of fixes from the list"
      
      * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
        Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
        btrfs: unlock i_mutex after attempting to delete subvolume during send
        btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
        btrfs: fix race on ENOMEM in alloc_extent_buffer
        btrfs: handle ENOMEM in btrfs_alloc_tree_block
        Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
        Btrfs: don't check for delalloc_bytes in cache_save_setup
        Btrfs: fix deadlock when starting writeback of bg caches
        Btrfs: fix race between start dirty bg cache writeout and bg deletion
      64887b68
    • L
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 036f351e
      Linus Torvalds 提交于
      Pull arm64 fixes from Will Deacon:
       "Not too much here, but we've addressed a couple of nasty issues in the
        dma-mapping code as well as adding the halfword and byte variants of
        load_acquire/store_release following on from the CSD locking bug that
        you fixed in the core.
      
         - fix perf devicetree warnings at probe time
      
         - fix memory leak in __dma_free()
      
         - ensure DMA buffers are always zeroed
      
         - show IRQ trigger in /proc/interrupts (for parity with ARM)
      
         - implement byte and halfword access for smp_{load_acquire,store_release}"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: perf: Fix the pmu node name in warning message
        arm64: perf: don't warn about missing interrupt-affinity property for PPIs
        arm64: add missing PAGE_ALIGN() to __dma_free()
        arm64: dma-mapping: always clear allocated buffers
        ARM64: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL
        arm64: add missing data types in smp_load_acquire/smp_store_release
      036f351e
    • S
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff 提交于
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • G
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan 提交于
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454 ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0