1. 06 11月, 2012 1 次提交
    • M
      tcg/ppc32: Use trampolines to trim the code size for mmu slow path accessors · c878da3b
      malc 提交于
      mmu access looks something like:
      
      <check tlb>
      if miss goto slow_path
      <fast path>
      done:
      ...
      
      ; end of the TB
      slow_path:
       <pre process>
       mr r3, r27         ; move areg0 to r3
                          ; (r3 holds the first argument for all the PPC32 ABIs)
       <call mmu_helper>
       b $+8
       .long done
       <post process>
       b done
      
      On ppc32 <call mmu_helper> is:
      
      (SysV and Darwin)
      
      mmu_helper is most likely not within direct branching distance from
      the call site, necessitating
      
      a. moving 32 bit offset of mmu_helper into a GPR ; 8 bytes
      b. moving GPR to CTR/LR                          ; 4 bytes
      c. (finally) branching to CTR/LR                 ; 4 bytes
      
      r3 setting              - 4 bytes
      call                    - 16 bytes
      dummy jump over retaddr - 4 bytes
      embedded retaddr        - 4 bytes
               Total overhead - 28 bytes
      
      (PowerOpen (AIX))
      a. moving 32 bit offset of mmu_helper's TOC into a GPR1 ; 8 bytes
      b. loading 32 bit function pointer into GPR2            ; 4 bytes
      c. moving GPR2 to CTR/LR                                ; 4 bytes
      d. loading 32 bit small area pointer into R2            ; 4 bytes
      e. (finally) branching to CTR/LR                        ; 4 bytes
      
      r3 setting              - 4 bytes
      call                    - 24 bytes
      dummy jump over retaddr - 4 bytes
      embedded retaddr        - 4 bytes
               Total overhead - 36 bytes
      
      Following is done to trim the code size of slow path sections:
      
      In tcg_target_qemu_prologue trampolines are emitted that look like this:
      
      trampoline:
      mfspr r3, LR
      addi  r3, 4
      mtspr LR, r3      ; fixup LR to point over embedded retaddr
      mr    r3, r27
      <jump mmu_helper> ; tail call of sorts
      
      And slow path becomes:
      
      slow_path:
       <pre process>
       <call trampoline>
       .long done
       <post process>
       b done
      
      call                    - 4 bytes (trampoline is within code gen buffer
                                         and most likely accessible via
                                         direct branch)
      embedded retaddr        - 4 bytes
               Total overhead - 8 bytes
      
      In the end the icache pressure is decreased by 20/28 bytes at the cost
      of an extra jump to trampoline and adjusting LR (to skip over embedded
      retaddr) once inside.
      Signed-off-by: Nmalc <av1474@comtv.ru>
      c878da3b
  2. 05 11月, 2012 2 次提交
  3. 04 11月, 2012 1 次提交
  4. 03 11月, 2012 33 次提交
  5. 02 11月, 2012 3 次提交