1. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  2. 13 5月, 2009 1 次提交
  3. 12 5月, 2009 1 次提交
  4. 11 5月, 2009 2 次提交
    • J
      x86: fix percpu_{to,from}_op() · 3c598766
      Jan Beulich 提交于
      - the byte operand constraints were wrong for 32-bit
      - the to-op's input operands weren't properly parenthesized
      
      [ Impact: fix possible miscompilation or build failure ]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      3c598766
    • Y
      x86: mtrr: Fix high_width computation when phys-addr is >= 44bit · 917a0153
      Yinghai Lu 提交于
      found one system where cpu address line is 44bits, mtrr printout
      is not right:
      
       [    0.000000] MTRR variable ranges enabled:
       [    0.000000]   0 base 0   00000000 mask FF0 00000000 write-back
       [    0.000000]   1 base 10  00000000 mask FFF 80000000 write-back
       [    0.000000]   2 base 0   80000000 mask FFF 80000000 uncachable
       [    0.000000]   3 base 0   7F800000 mask FFF FF800000 uncachable
      
      Li Zefan and Frederic pointed out the high_width could be -4 some how.
      
      It turns out when phys_addr is 44bit, size_or_mask will be
      ffffffff,00000000 so ffs(size_or_mask) will be 0.
      
      Try to check low 32 bit, to get correct high_width.
      Signed-off-by: NYinghai Lu <yinghai@kerne.org>
      Also-analyzed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Also-analyzed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Zhaolei <zhaolei@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4A026540.8060504@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      917a0153
  5. 10 5月, 2009 1 次提交
  6. 09 5月, 2009 8 次提交
  7. 08 5月, 2009 13 次提交
    • P
      mtd: fix timeout in M25P80 driver · cd1a6de7
      Peter Horton 提交于
      Extend erase timeout in M25P80 SPI Flash driver.
      
      The M25P80 drivers fails erasing sectors on a M25P128 because the ready
      wait timeout is too short. Change the timeout from a simple loop count to a
      suitable number of seconds.
      Signed-off-by: NPeter Horton <zero@colonel-panic.org>
      Tested-by: NMartin Michlmayr <tbm@cyrius.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      cd1a6de7
    • H
      x86: MCE: make cmci_discover_lock irq-safe · e5299926
      Hidetoshi Seto 提交于
      Lockdep reports the warning below when Li tries to offline one cpu:
      
      [  110.835487] =================================
      [  110.835616] [ INFO: inconsistent lock state ]
      [  110.835688] 2.6.30-rc4-00336-g8c9ed899 #52
      [  110.835757] ---------------------------------
      [  110.835828] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
      [  110.835908] swapper/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
      [  110.835982]  (cmci_discover_lock){?.+...}, at: [<ffffffff80236dc0>] cmci_clear+0x30/0x9b
      
      cmci_clear() can be called via smp_call_function_single().
      
      It is better to disable interrupt while holding cmci_discover_lock,
      to turn it into an irq-safe lock - we can deadlock otherwise.
      
      [ Impact: fix possible deadlock in the MCE code ]
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4A03ED38.8000700@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Reported-by: Shaohua Li<shaohua.li@intel.com>
      e5299926
    • J
      x86: xen, i386: reserve Xen pagetables · 33df4db0
      Jeremy Fitzhardinge 提交于
      The Xen pagetables are no longer implicitly reserved as part of the other
      i386_start_kernel reservations, so make sure we explicitly reserve them.
      This prevents them from being released into the general kernel free page
      pool and reused.
      
      [ Impact: fix Xen guest crash ]
      Also-Bisected-by: NBryan Donlan <bdonlan@gmail.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4A032EEC.30509@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      33df4db0
    • H
      x86, kexec: fix crashdump panic with CONFIG_KEXEC_JUMP · 6407df5c
      Huang Ying 提交于
      Tim Starling reported that crashdump will panic with kernel compiled
      with CONFIG_KEXEC_JUMP due to null pointer deference in
      machine_kexec_32.c: machine_kexec(), when deferencing
      kexec_image. Refering to:
      
      http://bugzilla.kernel.org/show_bug.cgi?id=13265
      
      This patch fixes the BUG via replacing global variable reference:
      kexec_image in machine_kexec() with local variable reference: image,
      which is more appropriate, and will not be null.
      
      Same BUG is in machine_kexec_64.c too, so fixed too in the same way.
      
      [ Impact: fix crash on kexec ]
      Reported-by: NTim Starling <tstarling@wikimedia.org>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1241751101.6259.85.camel@yhuang-dev.sh.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      6407df5c
    • J
      x86-64: finish cleanup_highmaps()'s job wrt. _brk_end · 49834396
      Jan Beulich 提交于
      With the introduction of the .brk section, special care must be taken
      that no unused page table entries remain if _brk_end and _end are
      separated by a 2M page boundary. cleanup_highmap() runs very early and
      hence cannot take care of that, hence potential entries needing to be
      removed past _brk_end must be cleared once the brk allocator has done
      its job.
      
      [ Impact: avoids undesirable TLB aliases ]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      49834396
    • J
      x86: fix boot hang in early_reserve_e820() · 61438766
      Jan Beulich 提交于
      If the first non-reserved (sub-)range doesn't fit the size requested,
      an endless loop will be entered. If a range returned from
      find_e820_area_size() turns out insufficient in size, the range must
      be skipped before calling the function again.
      
      [ Impact: fixes boot hang on some platforms ]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      61438766
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 · d7a59269
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
        [CIFS] Fix double list addition in cifs posix open code
        [CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp
        [CIFS] Fix SMB uid in NTLMSSP authenticate request
        [CIFS] NTLMSSP reenabled after move from connect.c to sess.c
        [CIFS] Remove sparse warning
        [CIFS] remove checkpatch warning
        [CIFS] Fix final user of old string conversion code
        [CIFS] remove cifs_strfromUCS_le
        [CIFS] NTLMSSP support moving into new file, old dead code removed
        [CIFS] Fix endian conversion of vcnum field
        [CIFS] Remove trailing whitespace
        [CIFS] Remove sparse endian warnings
        [CIFS] Add remaining ntlmssp flags and standardize field names
        [CIFS] Fix build warning
        cifs: fix length handling in cifs_get_name_from_search_buf
        [CIFS] Remove unneeded QuerySymlink call and fix mapping for unmapped status
        [CIFS] rename cifs_strndup to cifs_strndup_from_ucs
        Added loop check when mounting DFS tree.
        Enable dfs submounts to handle remote referrals.
        [CIFS] Remove older session setup implementation
        ...
      d7a59269
    • S
      [CIFS] Fix double list addition in cifs posix open code · 90e4ee5d
      Steve French 提交于
      Remove adding open file entry twice to lists in the file
      Do not fill file info twice in case of posix opens and creates
      Signed-off-by: NShirish Pargaonkar <shirishp@us.ibm.com>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      90e4ee5d
    • D
      NOMMU: Don't check vm_region::vm_start is page aligned in add_nommu_region() · 8c9ed899
      David Howells 提交于
      Don't check vm_region::vm_start is page aligned in add_nommu_region() because
      the region may reflect some non-page-aligned mapped file, such as could be
      obtained from RomFS XIP.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c9ed899
    • L
      Merge branch 'for-linus' of git://neil.brown.name/md · ee7fee0b
      Linus Torvalds 提交于
      * 'for-linus' of git://neil.brown.name/md:
        md: remove rd%d links immediately after stopping an array.
        md: remove ability to explicit set an inactive array to 'clean'.
        md: constify VFTs
        md: tidy up status_resync to handle large arrays.
        md: fix some (more) errors with bitmaps on devices larger than 2TB.
        md/raid10: don't clear bitmap during recovery if array will still be degraded.
        md: fix loading of out-of-date bitmap.
      ee7fee0b
    • L
      random: make get_random_int() more random · 8a0a9bd4
      Linus Torvalds 提交于
      It's a really simple patch that basically just open-codes the current
      "secure_ip_id()" call, but when open-coding it we now use a _static_
      hashing area, so that it gets updated every time.
      
      And to make sure somebody can't just start from the same original seed of
      all-zeroes, and then do the "half_md4_transform()" over and over until
      they get the same sequence as the kernel has, each iteration also mixes in
      the same old "current->pid + jiffies" we used - so we should now have a
      regular strong pseudo-number generator, but we also have one that doesn't
      have a single seed.
      
      Note: the "pid + jiffies" is just meant to be a tiny tiny bit of noise. It
      has no real meaning. It could be anything. I just picked the previous
      seed, it's just that now we keep the state in between calls and that will
      feed into the next result, and that should make all the difference.
      
      I made that hash be a per-cpu data just to avoid cache-line ping-pong:
      having multiple CPU's write to the same data would be fine for randomness,
      and add yet another layer of chaos to it, but since get_random_int() is
      supposed to be a fast interface I did it that way instead. I considered
      using "__raw_get_cpu_var()" to avoid any preemption overhead while still
      getting the hash be _mostly_ ping-pong free, but in the end good taste won
      out.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a0a9bd4
    • L
      Merge master.kernel.org:/home/rmk/linux-2.6-arm · 2c66fa7e
      Linus Torvalds 提交于
      * master.kernel.org:/home/rmk/linux-2.6-arm:
        [ARM] 5507/1: support R_ARM_MOVW_ABS_NC and MOVT_ABS relocation types
        [ARM] 5506/1: davinci: DMA_32BIT_MASK --> DMA_BIT_MASK(32)
        i.MX31: Disable CPU_32v6K in mx3_defconfig.
        mx3fb: Fix compilation with CONFIG_PM
        mx27ads: move PBC mapping out of vmalloc space
        MXC: remove BUG_ON in interrupt handler
        mx31: remove mx31moboard_defconfig
        ARM: ARCH_MXC should select HAVE_CLK
        mxc : BUG in imx_dma_request
        mxc : Clean up properly when imx_dma_free() used without imx_dma_disable()
        [ARM] mv78xx0: update defconfig
        [ARM] orion5x: update defconfig
        [ARM] Kirkwood: update defconfig
        [ARM] Kconfig typo fix:  "PXA930" -> "CPU_PXA930".
        [ARM] S3C2412: Add missing cache flush in suspend code
        [ARM] S3C: Add UDIVSLOT support for newer UARTS
        [ARM] S3C64XX: Add S3C64XX_PA_IIS{0,1} to <mach/map.h>
      2c66fa7e
    • P
      [ARM] 5507/1: support R_ARM_MOVW_ABS_NC and MOVT_ABS relocation types · ae51e609
      Paul Gortmaker 提交于
      From: Bruce Ashfield <bruce.ashfield@windriver.com>
      
      To fully support the armv7-a instruction set/optimizations, support
      for the R_ARM_MOVW_ABS_NC and R_ARM_MOVT_ABS relocation types is
      required.
      
      The MOVW and MOVT are both load-immediate instructions, MOVW loads 16
      bits into the bottom half of a register, and MOVT loads 16 bits into the
      top half of a register.
      
      The relocation information for these instructions has a full 32 bit
      value, plus an addend which is stored in the 16 immediate bits in the
      instruction itself.  The immediate bits in the instruction are not
      contiguous (the register # splits it into a 4 bit and 12 bit value),
      so the addend has to be extracted accordingly and added to the value.
      The value is then split and put into the instruction; a MOVW uses the
      bottom 16 bits of the value, and a MOVT uses the top 16 bits.
      Signed-off-by: NDavid Borman <david.borman@windriver.com>
      Signed-off-by: NBruce Ashfield <bruce.ashfield@windriver.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      ae51e609
  8. 07 5月, 2009 13 次提交