1. 25 3月, 2020 6 次提交
  2. 22 2月, 2020 2 次提交
    • S
      math: fix sinh overflows in non-nearest rounding · d2055814
      Szabolcs Nagy 提交于
      The final rounding operation should be done with the correct sign
      otherwise huge results may incorrectly get rounded to or away from
      infinity in upward or downward rounding modes.
      
      This affected sinh and sinhf which set the sign on the result after
      a potentially overflowing mul. There may be other non-nearest rounding
      issues, but this was a known long standing issue with large ulp error
      (depending on how ulp is defined near infinity).
      
      The fix should have no effect on sinh and sinhf performance but may
      have a tiny effect on cosh and coshf.
      d2055814
    • S
      math: fix __rem_pio2 in non-nearest rounding modes · b3797d3b
      Szabolcs Nagy 提交于
      Handle when after reduction |y| > pi/4+tiny. This happens in directed
      rounding modes because the fast round to int code does not give the
      nearest integer. In such cases the reduction may not be symmetric
      between x and -x so e.g. cos(x)==cos(-x) may not hold (but polynomial
      evaluation is not symmetric either with directed rounding so fixing
      that would require more changes with bigger performance impact).
      
      The fix only adds two predictable branches in nearest rounding mode,
      simple ubenchmark does not show relevant performance regression in
      nearest rounding mode.
      
      The code could be improved: e.g reducing the medium size threshold
      such that two step reduction is enough instead of three, and the
      single precision case can avoid the issue by doing the round to int
      differently, but this fix was kept minimal.
      b3797d3b
  3. 07 2月, 2020 5 次提交
  4. 28 1月, 2020 1 次提交
  5. 06 11月, 2019 1 次提交
  6. 14 10月, 2019 2 次提交
    • I
      mips: add single-instruction math functions · 1c9d2cba
      info@mobile-stream.com 提交于
      SQRT.fmt exists on MIPS II+ (float), MIPS III+ (double).
      
      ABS.fmt exists on MIPS I+ but only cores with ABS2008 flag in FCSR
      implement the required behaviour.
      1c9d2cba
    • S
      math: fix signed int left shift ub in sqrt · e8580630
      Szabolcs Nagy 提交于
      Both sqrt and sqrtf shifted the signed exponent as signed int to adjust
      the bit representation of the result. There are signed right shifts too
      in the code but those are implementation defined and are expected to
      compile to arithmetic shift on supported compilers and targets.
      e8580630
  7. 27 9月, 2019 1 次提交
    • S
      math: optimize lrint on 32bit targets · ca577951
      Szabolcs Nagy 提交于
      lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
      is not trivial: rounding to int may be inexact, but the conversion to
      int may overflow and then the inexact flag must not be raised. (the
      overflow threshold is rounding mode dependent).
      
      this matters on 32bit targets (without single instruction lrint or
      rint), so the common case (when there is no overflow) is optimized by
      inlining the lrint logic, otherwise the old code is kept as a fallback.
      
      on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
      on a smaller arm core: old c:71ns, new c:34ns
      on a bigger arm core: old c:27ns, new c:19ns
      ca577951
  8. 06 8月, 2019 2 次提交
    • R
      fix build regression in i386 asm for atan2, atan2f · 6818c31c
      Rich Felker 提交于
      commit f3ed8bfe inadvertently removed
      labels that were still needed.
      6818c31c
    • R
      fix x87 stack imbalance in corner cases of i386 math asm · f3ed8bfe
      Rich Felker 提交于
      commit 31c5fb80 introduced underflow
      code paths for the i386 math asm, along with checks on the fpu status
      word to skip the underflow-generation instructions if the underflow
      flag was already raised. unfortunately, at least one such path, in
      log1p, returned with 2 items on the x87 stack rather than just 1 item
      for the return value. this is a violation of the ABI's calling
      convention, and could cause subsequent floating point code to produce
      NANs due to x87 stack overflow. if floating point results are used in
      flow control, this can lead to runaway wrong code execution.
      
      rather than reviewing each "underflow already raised" code path for
      correctness, remove them all. they're likely slower than just
      performing the underflow code unconditionally, and significantly more
      complex.
      
      all of this code should be ripped out and replaced by C source files
      with inline asm. doing so would preclude this kind of error by having
      the compiler perform all x87 stack register allocation and stack
      manipulation, and would produce comparable or better code. however
      such a change is a much larger project.
      f3ed8bfe
  9. 15 6月, 2019 1 次提交
    • R
      add riscv64 architecture support · 0a48860c
      Rich Felker 提交于
      Author: Alex Suykov <alex.suykov@gmail.com>
      Author: Aric Belsito <lluixhi@gmail.com>
      Author: Drew DeVault <sir@cmpwn.com>
      Author: Michael Clark <mjc@sifive.com>
      Author: Michael Forney <mforney@mforney.org>
      Author: Stefan O'Rear <sorear2@gmail.com>
      
      This port has involved the work of many people over several years. I
      have tried to ensure that everyone with substantial contributions has
      been credited above; if any omissions are found they will be noted
      later in an update to the authors/contributors list in the COPYRIGHT
      file.
      
      The version committed here comes from the riscv/riscv-musl repo's
      commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by
      me for issues found during final review:
      
      - a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc
        are not safe to use in separate inline asm fragments)
      
      - a_cas[_p] is fixed to be a memory barrier
      
      - the call from the _start assembly into the C part of crt1/ldso is
        changed to allow for the possibility that the linker does not place
        them nearby each other.
      
      - DTP_OFFSET is defined correctly so that local-dynamic TLS works
      
      - reloc.h LDSO_ARCH logic is simplified and made explicit.
      
      - unused, non-functional crti/n asm files are removed.
      
      - an empty .sdata section is added to crt1 so that the
        __global_pointer reference is resolvable.
      
      - indentation style errors in some asm files are fixed.
      0a48860c
  10. 18 4月, 2019 10 次提交
    • S
      math: new pow · e4dd6530
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      The underflow exception is signaled if the result is in the subnormal
      range even if the result is exact.
      
      code size change: +3421 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
         pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x
          pow latency: 144.37 ns/call 54.75 ns/call 2.64x
      -O3:
         pow rthruput:  98.91 ns/call 32.79 ns/call 3.02x
          pow latency: 138.74 ns/call 53.78 ns/call 2.58x
      e4dd6530
    • S
      math: new exp and exp2 · e16f7b3c
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      TOINT_INTRINSICS and EXP_USE_TOINT_NARROW cases are unused.
      
      The underflow exception is signaled if the result is in the subnormal
      range even if the result is exact (e.g. exp2(-1023.0)).
      
      code size change: -1672 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
         exp rthruput:  12.73 ns/call  6.68 ns/call 1.91x
          exp latency:  45.78 ns/call 21.79 ns/call 2.1x
        exp2 rthruput:   6.35 ns/call  5.26 ns/call 1.21x
         exp2 latency:  26.00 ns/call 16.58 ns/call 1.57x
      -O3:
         exp rthruput:  12.75 ns/call  6.73 ns/call 1.89x
          exp latency:  45.91 ns/call 21.80 ns/call 2.11x
        exp2 rthruput:   6.47 ns/call  5.40 ns/call 1.2x
         exp2 latency:  26.03 ns/call 16.54 ns/call 1.57x
      e16f7b3c
    • S
      math: new log2 · 2a3210cf
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      code size change: +2458 bytes (+1524 bytes with fma).
      benchmark on x86_64 before, after, speedup:
      
      -Os:
        log2 rthruput:  16.08 ns/call 10.49 ns/call 1.53x
         log2 latency:  44.54 ns/call 25.55 ns/call 1.74x
      -O3:
        log2 rthruput:  15.92 ns/call 10.11 ns/call 1.58x
         log2 latency:  44.66 ns/call 26.16 ns/call 1.71x
      2a3210cf
    • S
      math: new log · 236cd056
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single
      instruction.
      
      code size change: +4588 bytes (+2540 bytes with fma).
      benchmark on x86_64 before, after, speedup:
      
      -Os:
         log rthruput:  12.61 ns/call  7.95 ns/call 1.59x
          log latency:  41.64 ns/call 23.38 ns/call 1.78x
      -O3:
         log rthruput:  12.51 ns/call  7.75 ns/call 1.61x
          log latency:  41.82 ns/call 23.55 ns/call 1.78x
      236cd056
    • S
      math: new powf · d28cd0ad
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      POWF_SCALE != 1.0 case only matters if TOINT_INTRINSICS is set, which
      is currently not supported for any target.
      
      SNaN is not supported, it would require an issignalingf
      implementation.
      
      code size change: -816 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
        powf rthruput:  95.14 ns/call 20.04 ns/call 4.75x
         powf latency: 137.00 ns/call 34.98 ns/call 3.92x
      -O3:
        powf rthruput:  92.48 ns/call 13.67 ns/call 6.77x
         powf latency: 131.11 ns/call 35.15 ns/call 3.73x
      d28cd0ad
    • S
      math: new exp2f and expf · 3f94c648
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      In expf TOINT_INTRINSICS is kept, but is unused, it would require support
      for __builtin_round and __builtin_lround as single instruction.
      
      code size change: +94 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
        expf rthruput:   9.19 ns/call  8.11 ns/call 1.13x
         expf latency:  34.19 ns/call 18.77 ns/call 1.82x
       exp2f rthruput:   5.59 ns/call  6.52 ns/call 0.86x
        exp2f latency:  17.93 ns/call 16.70 ns/call 1.07x
      -O3:
        expf rthruput:   9.12 ns/call  4.92 ns/call 1.85x
         expf latency:  34.44 ns/call 18.99 ns/call 1.81x
       exp2f rthruput:   5.58 ns/call  4.49 ns/call 1.24x
        exp2f latency:  17.95 ns/call 16.94 ns/call 1.06x
      3f94c648
    • S
      math: new log2f · 098868b3
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc
      
      code size change: +177 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
       log2f rthruput:  11.38 ns/call  5.99 ns/call 1.9x
        log2f latency:  35.01 ns/call 22.57 ns/call 1.55x
      -O3:
       log2f rthruput:  10.82 ns/call  5.58 ns/call 1.94x
        log2f latency:  35.13 ns/call 21.04 ns/call 1.67x
      098868b3
    • S
      math: new logf · db505b79
      Szabolcs Nagy 提交于
      from https://github.com/ARM-software/optimized-routines,
      commit 04884bd04eac4b251da4026900010ea7d8850edc,
      with minor changes to better fit into musl.
      
      code size change: +289 bytes.
      benchmark on x86_64 before, after, speedup:
      
      -Os:
        logf rthruput:   8.40 ns/call  6.14 ns/call 1.37x
         logf latency:  31.79 ns/call 24.33 ns/call 1.31x
      -O3:
        logf rthruput:   8.43 ns/call  5.58 ns/call 1.51x
         logf latency:  32.04 ns/call 20.88 ns/call 1.53x
      db505b79
    • S
      math: add double precision error handling functions · 4f8acf95
      Szabolcs Nagy 提交于
      4f8acf95
    • S
      math: add single precision error handling functions · 9ef6ca42
      Szabolcs Nagy 提交于
      These are supposed to be used in tail call positions when handling
      special cases in new code. (fp exceptions may be raised "naturally"
      by the common code path if special casing is more effort.)
      
      This implements the error handling apis used in
      https://github.com/ARM-software/optimized-routines
      without errno setting.
      9ef6ca42
  11. 04 4月, 2019 1 次提交
    • D
      fix unintended global symbols in atanl.c · 81868803
      Dan Gohman 提交于
      Mark atanhi, atanlo, and aT in atanl.c as static, as they're not
      intended to be part of the public API.
      
      These are already static in the LDBL_MANT_DIG == 64 code, so this
      patch is just making the LDBL_MANT_DIG == 113 code do the same thing.
      81868803
  12. 16 10月, 2018 4 次提交
    • S
      x86_64: add single instruction fma · e9016138
      Szabolcs Nagy 提交于
      fma is only available on recent x86_64 cpus and it is much faster than
      a software fma, so this should be done with a runtime check, however
      that requires more changes, this patch just adds the code so it can be
      tested when musl is compiled with -mfma or -mfma4.
      e9016138
    • S
      arm: add single instruction fma · 7396ef0a
      Szabolcs Nagy 提交于
      vfma is available in the vfpv4 fpu and above, the ACLE standard feature
      test for double precision hardware fma support is
        __ARM_FEATURE_FMA && __ARM_FP&8
      we need further checks to work around clang bugs (fixed in clang >=7.0)
        && !__SOFTFP__
      because __ARM_FP is defined even with -mfloat-abi=soft
        && !BROKEN_VFP_ASM
      to disable the single precision code when inline asm handling is broken.
      
      For runtime selection the HWCAP_ARM_VFPv4 hwcap flag can be used, but
      that requires further work.
      7396ef0a
    • S
      powerpc: add single instruction fabs, fabsf, fma, fmaf, sqrt, sqrtf · 7c5f3bb9
      Szabolcs Nagy 提交于
      These are only available on hard float target and sqrt is not available
      in the base ISA, so further check is used.
      7c5f3bb9
    • S
      s390x: add single instruction fma and fmaf · 1da534ad
      Szabolcs Nagy 提交于
      These are available in the s390x baseline isa -march=z900.
      1da534ad
  13. 13 9月, 2018 4 次提交
    • R
      reduce spurious inclusion of libc.h · 5ce37379
      Rich Felker 提交于
      libc.h was intended to be a header for access to global libc state and
      related interfaces, but ended up included all over the place because
      it was the way to get the weak_alias macro. most of the inclusions
      removed here are places where weak_alias was needed. a few were
      recently introduced for hidden. some go all the way back to when
      libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
      cancellation points had to include it.
      
      remaining spurious users are mostly callers of the LOCK/UNLOCK macros
      and files that use the LFS64 macro to define the awful *64 aliases.
      
      in a few places, new inclusion of libc.h is added because several
      internal headers no longer implicitly include libc.h.
      
      declarations for __lockfile and __unlockfile are moved from libc.h to
      stdio_impl.h so that the latter does not need libc.h. putting them in
      libc.h made no sense at all, since the macros in stdio_impl.h are
      needed to use them correctly anyway.
      5ce37379
    • R
      7e399fab
    • R
      apply hidden visibility to internal math functions · 46e3895b
      Rich Felker 提交于
      this makes significant differences to codegen on archs with an
      expensive PLT-calling ABI; on i386 and gcc 7.3 for example, the sin
      and sinf functions no longer touch call-saved registers or the stack
      except for pushing outgoing arguments. performance is likely improved
      too, but no measurements were taken.
      46e3895b
    • R
      move lgamma-related internal declarations to libm.h · 59d88940
      Rich Felker 提交于
      59d88940