1. 21 1月, 2012 1 次提交
    • J
      x86: Adjust asm constraints in atomic64 wrappers · 819165fb
      Jan Beulich 提交于
      Eric pointed out overly restrictive constraints in atomic64_set(), but
      there are issues throughout the file. In the cited case, %ebx and %ecx
      are inputs only (don't get changed by either of the two low level
      implementations). This was also the case elsewhere.
      
      Further in many cases early-clobber indicators were missing.
      
      Finally, the previous implementation rolled a custom alternative
      instruction macro from scratch, rather than using alternative_call()
      (which was introduced with the commit that the description of the
      change in question actually refers to). Adjusting has the benefit of
      not hiding referenced symbols from the compiler, which however requires
      them to be declared not just in the exporting source file (which, as a
      desirable side effect, in turn allows that exporting file to become a
      real 5-line stub).
      
      This patch does not eliminate the overly restrictive memory clobbers,
      however: Doing so would occasionally make the compiler set up a second
      register for accessing the memory object (to satisfy the added "m"
      constraint), and it's not clear which of the two non-optimal
      alternatives is better.
      
      v2: Re-do the declaration and exporting of the internal symbols.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/4F19A2A5020000780006E0D9@nat28.tlf.novell.com
      Cc: Luca Barbieri <luca@luca-barbieri.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      819165fb
  2. 18 1月, 2012 1 次提交
  3. 16 1月, 2012 1 次提交
  4. 13 12月, 2011 1 次提交
  5. 05 12月, 2011 2 次提交
  6. 10 10月, 2011 1 次提交
  7. 27 7月, 2011 1 次提交
  8. 22 7月, 2011 1 次提交
  9. 21 7月, 2011 3 次提交
  10. 14 7月, 2011 1 次提交
  11. 04 6月, 2011 1 次提交
  12. 18 5月, 2011 6 次提交
  13. 02 5月, 2011 1 次提交
  14. 28 3月, 2011 1 次提交
  15. 18 3月, 2011 2 次提交
  16. 02 3月, 2011 1 次提交
  17. 01 3月, 2011 3 次提交
  18. 28 2月, 2011 1 次提交
  19. 26 1月, 2011 1 次提交
  20. 04 1月, 2011 1 次提交
  21. 25 9月, 2010 1 次提交
    • M
      x86, mem: Optimize memmove for small size and unaligned cases · 3b4b682b
      Ma Ling 提交于
      movs instruction will combine data to accelerate moving data,
      however we need to concern two cases about it.
      
      1. movs instruction need long lantency to startup,
         so here we use general mov instruction to copy data.
      2. movs instruction is not good for unaligned case,
         even if src offset is 0x10, dest offset is 0x0,
         we avoid and handle the case by general mov instruction.
      Signed-off-by: NMa Ling <ling.ma@intel.com>
      LKML-Reference: <1284664360-6138-1-git-send-email-ling.ma@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      3b4b682b
  22. 24 8月, 2010 2 次提交
    • M
      x86, mem: Optimize memcpy by avoiding memory false dependece · 59daa706
      Ma Ling 提交于
      All read operations after allocation stage can run speculatively,
      all write operation will run in program order, and if addresses are
      different read may run before older write operation, otherwise wait
      until write commit. However CPU don't check each address bit,
      so read could fail to recognize different address even they
      are in different page.For example if rsi is 0xf004, rdi is 0xe008,
      in following operation there will generate big performance latency.
      1. movq (%rsi),	%rax
      2. movq %rax,	(%rdi)
      3. movq 8(%rsi), %rax
      4. movq %rax,	8(%rdi)
      
      If %rsi and rdi were in really the same meory page, there are TRUE
      read-after-write dependence because instruction 2 write 0x008 and
      instruction 3 read 0x00c, the two address are overlap partially.
      Actually there are in different page and no any issues,
      but without checking each address bit CPU could think they are
      in the same page, and instruction 3 have to wait for instruction 2
      to write data into cache from write buffer, then load data from cache,
      the cost time read spent is equal to mfence instruction. We may avoid it by
      tuning operation sequence as follow.
      
      1. movq 8(%rsi), %rax
      2. movq %rax,	8(%rdi)
      3. movq (%rsi),	%rax
      4. movq %rax,	(%rdi)
      
      Instruction 3 read 0x004, instruction 2 write address 0x010, no any
      dependence.  At last on Core2 we gain 1.83x speedup compared with
      original instruction sequence.  In this patch we first handle small
      size(less 20bytes), then jump to different copy mode. Based on our
      micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
      improvement, and up to 1.5X improvement for 1024 bytes on Corei7.  (We
      use our micro-benchmark, and will do further test according to your
      requirment)
      Signed-off-by: NMa Ling <ling.ma@intel.com>
      LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      59daa706
    • M
      x86, mem: Don't implement forward memmove() as memcpy() · fdf42896
      Ma, Ling 提交于
      memmove() allow source and destination address to be overlap, but
      there is no such limitation for memcpy().  Therefore, explicitly
      implement memmove() in both the forwards and backward directions, to
      give us the ability to optimize memcpy().
      Signed-off-by: NMa Ling <ling.ma@intel.com>
      LKML-Reference: <C10D3FB0CD45994C8A51FEC1227CE22F0E483AD86A@shsmsx502.ccr.corp.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      fdf42896
  23. 12 8月, 2010 2 次提交
  24. 29 7月, 2010 2 次提交
  25. 14 7月, 2010 1 次提交
  26. 08 7月, 2010 1 次提交
    • H
      x86, alternatives: Use 16-bit numbers for cpufeature index · 83a7a2ad
      H. Peter Anvin 提交于
      We already have cpufeature indicies above 255, so use a 16-bit number
      for the alternatives index.  This consumes a padding field and so
      doesn't add any size, but it means that abusing the padding field to
      create assembly errors on overflow no longer works.  We can retain the
      test simply by redirecting it to the .discard section, however.
      
      [ v3: updated to include open-coded locations ]
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      LKML-Reference: <tip-f88731e3068f9d1392ba71cc9f50f035d26a0d4f@git.kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      83a7a2ad