1. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  2. 10 5月, 2016 1 次提交
    • J
      MIPS: KVM: Fix timer IRQ race when writing CP0_Compare · b45bacd2
      James Hogan 提交于
      Writing CP0_Compare clears the timer interrupt pending bit
      (CP0_Cause.TI), but this wasn't being done atomically. If a timer
      interrupt raced with the write of the guest CP0_Compare, the timer
      interrupt could end up being pending even though the new CP0_Compare is
      nowhere near CP0_Count.
      
      We were already updating the hrtimer expiry with
      kvm_mips_update_hrtimer(), which used both kvm_mips_freeze_hrtimer() and
      kvm_mips_resume_hrtimer(). Close the race window by expanding out
      kvm_mips_update_hrtimer(), and clearing CP0_Cause.TI and setting
      CP0_Compare between the freeze and resume. Since the pending timer
      interrupt should not be cleared when CP0_Compare is written via the KVM
      user API, an ack argument is added to distinguish the source of the
      write.
      
      Fixes: e30492bb ("MIPS: KVM: Rewrite count/compare timer emulation")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.16.x-
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b45bacd2
  3. 03 4月, 2016 3 次提交
  4. 30 3月, 2016 1 次提交
    • J
      MIPS: cpu_name_string: Use raw_smp_processor_id(). · e95008a1
      James Hogan 提交于
      If cpu_name_string() is used in non-atomic context when preemption is
      enabled, it can trigger a BUG such as this one:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: unaligned/156
      caller is __show_regs+0x1e4/0x330
      CPU: 2 PID: 156 Comm: unaligned Tainted: G        W       4.3.0-00366-ga3592179816d-dirty #1501
      Stack : ffffffff80900000 ffffffff8019bc18 000000000000005f ffffffff80a20000
               0000000000000000 0000000000000009 ffffffff8019c0e0 ffffffff80835648
               a8000000ff2bdec0 ffffffff80a1e628 000000000000009c 0000000000000002
               ffffffff80840000 a8000000fff2ffb0 0000000000000020 ffffffff8020e43c
               a8000000fff2fcf8 ffffffff80a20000 0000000000000000 ffffffff808f2607
               ffffffff8082b138 ffffffff8019cd1c 0000000000000030 ffffffff8082b138
               0000000000000002 000000000000009c 0000000000000000 0000000000000000
               0000000000000000 a8000000fff2fc40 0000000000000000 ffffffff8044dbf4
               0000000000000000 0000000000000000 0000000000000000 ffffffff8010c400
               ffffffff80855bb0 ffffffff8010d008 0000000000000000 ffffffff8044dbf4
               ...
      Call Trace:
      [<ffffffff8010d008>] show_stack+0x90/0xb0
      [<ffffffff8044dbf4>] dump_stack+0x84/0xe0
      [<ffffffff8046d4ec>] check_preemption_disabled+0x10c/0x110
      [<ffffffff8010c40c>] __show_regs+0x1e4/0x330
      [<ffffffff8010d060>] show_registers+0x28/0xc0
      [<ffffffff80110748>] do_ade+0xcc8/0xce0
      [<ffffffff80105b84>] resume_userspace_check+0x0/0x10
      
      This is possible because cpu_name_string() is used by __show_regs(),
      which is used by both show_regs() and show_registers(). These two
      functions are used by various exception handling functions, only some of
      which ensure that interrupts or preemption is disabled.
      
      However the following have interrupts explicitly enabled or not
      explicitly disabled:
      - do_reserved() (irqs enabled)
      - do_ade() (irqs not disabled)
      
      This can be hit by setting /sys/kernel/debug/mips/unaligned_action to 2,
      and triggering an address error exception, e.g. an unaligned access or
      access to kernel segment from user mode.
      
      To fix the above cases, use raw_smp_processor_id() instead. It is
      unusual for CPU names to be different in the same system, and even if
      they were, its possible the process has migrated between the exception
      of interest and the cpu_name_string() call anyway.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/12212/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      e95008a1
  5. 14 3月, 2016 2 次提交
    • A
      ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short · 1e940829
      Alexander Duyck 提交于
      This patch updates csum_ipv6_magic so that it correctly recognizes that
      protocol is a unsigned 8 bit value.
      
      This will allow us to better understand what limitations may or may not be
      present in how we handle the data.  For example there are a number of
      places that call htonl on the protocol value.  This is likely not necessary
      and can be replaced with a multiplication by ntohl(1) which will be
      converted to a shift by the compiler.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e940829
    • A
      ipv4: Update parameters for csum_tcpudp_magic to their original types · 01cfbad7
      Alexander Duyck 提交于
      This patch updates all instances of csum_tcpudp_magic and
      csum_tcpudp_nofold to reflect the types that are usually used as the source
      inputs.  For example the protocol field is populated based on nexthdr which
      is actually an unsigned 8 bit value.  The length is usually populated based
      on skb->len which is an unsigned integer.
      
      This addresses an issue in which the IPv6 function csum_ipv6_magic was
      generating a checksum using the full 32b of skb->len while
      csum_tcpudp_magic was only using the lower 16 bits.  As a result we could
      run into issues when attempting to adjust the checksum as there was no
      protocol agnostic way to update it.
      
      With this change the value is still truncated as many architectures use
      "(len + proto) << 8", however this truncation only occurs for values
      greater than 16776960 in length and as such is unlikely to occur as we stop
      the inner headers at ~64K in size.
      
      I did have to make a few minor changes in the arm, mn10300, nios2, and
      score versions of the function in order to support these changes as they
      were either using things such as an OR to combine the protocol and length,
      or were using ntohs to convert the length which would have truncated the
      value.
      
      I also updated a few spots in terms of whitespace and type differences for
      the addresses.  Most of this was just to make sure all of the definitions
      were in sync going forward.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01cfbad7
  6. 08 3月, 2016 1 次提交
  7. 05 3月, 2016 1 次提交
    • D
      mm/pkeys: Fix siginfo ABI breakage caused by new u64 field · 49cd53bf
      Dave Hansen 提交于
      Stephen Rothwell reported this linux-next build failure:
      
      	http://lkml.kernel.org/r/20160226164406.065a1ffc@canb.auug.org.au
      
      ... caused by the Memory Protection Keys patches from the tip tree triggering
      a newly introduced build-time sanity check on an ARM build, because they changed
      the ABI of siginfo in an unexpected way.
      
      If u64 has a natural alignment of 8 bytes (which is the case on most mainstream
      platforms, with the notable exception of x86-32), then the leadup to the
      _sifields union matters:
      
      typedef struct siginfo {
              int si_signo;
              int si_errno;
              int si_code;
      
              union {
      	...
              } _sifields;
      } __ARCH_SI_ATTRIBUTES siginfo_t;
      
      Note how the first 3 fields give us 12 bytes, so _sifields is not 8
      naturally bytes aligned.
      
      Before the _pkey field addition the largest element of _sifields (on
      32-bit platforms) was 32 bits. With the u64 added, the minimum alignment
      requirement increased to 8 bytes on those (rare) 32-bit platforms. Thus
      GCC padded the space after si_code with 4 extra bytes, and shifted all
      _sifields offsets by 4 bytes - breaking the ABI of all of those
      remaining fields.
      
      On 64-bit platforms this problem was hidden due to _sifields already
      having numerous fields with natural 8 bytes alignment (pointers).
      
      To fix this, we replace the u64 with an '__u32'.  The __u32 does not
      increase the minimum alignment requirement of the union, and it is
      also large enough to store the 16-bit pkey we have today on x86.
      Reported-by: NStehen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NStehen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-next@vger.kernel.org
      Fixes: cd0ea35f ("signals, pkeys: Notify userspace about protection key faults")
      Link: http://lkml.kernel.org/r/20160301125451.02C7426D@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      49cd53bf
  8. 26 2月, 2016 1 次提交
    • T
      net: Facility to report route quality of connected sockets · a87cb3e4
      Tom Herbert 提交于
      This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
      purpose is to allow an application to give feedback to the kernel about
      the quality of the network path for a connected socket. The value
      argument indicates the type of quality report. For this initial patch
      the only supported advice is a value of 1 which indicates "bad path,
      please reroute"-- the action taken by the kernel is to call
      dst_negative_advice which will attempt to choose a different ECMP route,
      reset the TX hash for flow label and UDP source port in encapsulation,
      etc.
      
      This facility should be useful for connected UDP sockets where only the
      application can provide any feedback about path quality. It could also
      be useful for TCP applications that have additional knowledge about the
      path outside of the normal TCP control loop.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a87cb3e4
  9. 25 2月, 2016 1 次提交
  10. 18 2月, 2016 1 次提交
  11. 17 2月, 2016 2 次提交
  12. 11 2月, 2016 2 次提交
  13. 10 2月, 2016 1 次提交
  14. 06 2月, 2016 2 次提交
  15. 04 2月, 2016 1 次提交
  16. 02 2月, 2016 2 次提交
    • J
      MIPS: Fix FPU disable with preemption · 00fe56dc
      James Hogan 提交于
      The FPU should not be left enabled after a task context switch. This
      isn't usually a problem as the FPU enable bit is updated before
      returning to userland, however it can potentially mask kernel bugs, and
      in fact KVM assumes it won't happen and won't clear the FPU enable bit
      before returning to the guest, which allows the guest to use stale FPU
      context.
      
      Interrupts and exceptions save and restore most bits of the CP0 Status
      register which contains the FPU enable bit (CU1). When the kernel needs
      to enable or disable the FPU (for example due to attempted FPU use by
      userland, or the scheduler being invoked) both the actual Status
      register and the saved value in the userland context are updated.
      
      However this doesn't work correctly with full kernel preemption enabled,
      since the FPU enable bit can be cleared from within an interrupt when
      the scheduler is invoked, and only the userland context is updated, not
      the interrupt context.
      
      For example:
      1) Enter kernel with FPU already enabled, TIF_USEDFPU=1, Status.CU1=1
         saved.
      2) Take a timer interrupt while in kernel mode, Status.CU1=1 saved.
      3) Timer interrupt invokes scheduler to preempt the task, which clears
         TIF_USEDFPU, disables the FPU in Status register (Status.CU1=0), and
         the value stored in user context from step (1), but not the interrupt
         context from step (2).
      4) When the process is scheduled back in again Status.CU1=0.
      5) The interrupt context from step (2) is restored, which sets
         Status.CU1=1. So from user context point of view, preemption has
         re-enabled FPU!
      6) If the scheduler is invoked again (via preemption or voluntarily)
         before returning to userland, TIF_USEDFPU=0 so the FPU is not
         disabled before the task context switch.
      7) The next task resumes from the context switch with FPU enabled!
      
      The restoring of the Status register on return from interrupt/exception
      is already selective about which bits to restore, leaving the interrupt
      mask bits alone so enabling/disabling of CPU interrupt lines can
      persist. Extend this to also leave both the CU1 bit (FPU enable) and the
      FR bit (which specifies the FPU mode and gets changed with CU1). This
      prevents a stale Status value being restored in step (5) above and
      persisting through subsequent context switches.
      
      Also switch to the use of definitions from asm/mipsregs.h while we're at
      it.
      
      Since this change also affects the restoration of Status register on the
      path back to userland, it increases the sensitivity of the kernel to the
      problem of the FPU being left enabled, allowing it to propagate to
      userland, therefore a warning is also added to lose_fpu_inatomic() to
      point out any future reoccurances before they do any damage.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Reviewed-by: NPaul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/12303/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      00fe56dc
    • J
      MIPS: Fix buffer overflow in syscall_get_arguments() · f4dce1ff
      James Hogan 提交于
      Since commit 4c21b8fd ("MIPS: seccomp: Handle indirect system calls
      (o32)"), syscall_get_arguments() attempts to handle o32 indirect syscall
      arguments by incrementing both the start argument number and the number
      of arguments to fetch. However only the start argument number needs to
      be incremented. The number of arguments does not change, they're just
      shifted up by one, and in fact the output array is provided by the
      caller and is likely only n entries long, so reading more arguments
      overflows the output buffer.
      
      In the case of seccomp, this results in it fetching 7 arguments starting
      at the 2nd one, which overflows the unsigned long args[6] in
      populate_seccomp_data(). This clobbers the $s0 register from
      syscall_trace_enter() which __seccomp_phase1_filter() saved onto the
      stack, into which syscall_trace_enter() had placed its syscall number
      argument. This caused Chromium to crash.
      
      Credit goes to Milko for tracking it down as far as $s0 being clobbered.
      
      Fixes: 4c21b8fd ("MIPS: seccomp: Handle indirect system calls (o32)")
      Reported-by: NMilko Leporis <milko.leporis@imgtec.com>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: <stable@vger.kernel.org> # 3.15-
      Patchwork: https://patchwork.linux-mips.org/patch/12213/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      f4dce1ff
  17. 28 1月, 2016 1 次提交
  18. 27 1月, 2016 1 次提交
  19. 24 1月, 2016 15 次提交