提交 · 6bbdbfdcdeac216c4a13edd04dee1f6d87bd33c8 · OpenHarmony / Third Party Musl

25 3月, 2020 6 次提交
- A
  
  math: move x86_64 (l)lrint(f) functions to C with inline asm · 6bbdbfdc
  由 Alexander Monakov 提交于 1月 10, 2020
  
  6bbdbfdc
- A
  
  math: move i386 sqrt to C with inline asm · acfe6d03
  由 Alexander Monakov 提交于 1月 07, 2020
  
  acfe6d03
- A
  
  math: move i386 sqrtf to C with inline asm · 29adaeb2
  由 Alexander Monakov 提交于 1月 06, 2020
  
  29adaeb2
- A
  
  math: move trivial x86-family sqrt functions to C with inline asm · 41b290ba
  由 Alexander Monakov 提交于 1月 06, 2020
  
  41b290ba
- A
  
  math: move x87-family fabs functions to C with inline asm · c24a9923
  由 Alexander Monakov 提交于 1月 06, 2020
  
  c24a9923
- A
  
  math: move x86_64 fabs, fabsf to C with inline asm · 87026f68
  由 Alexander Monakov 提交于 1月 05, 2020
  
  87026f68
22 2月, 2020 2 次提交

math: fix sinh overflows in non-nearest rounding · d2055814

由 Szabolcs Nagy 提交于 1月 20, 2020

The final rounding operation should be done with the correct sign
otherwise huge results may incorrectly get rounded to or away from
infinity in upward or downward rounding modes.

This affected sinh and sinhf which set the sign on the result after
a potentially overflowing mul. There may be other non-nearest rounding
issues, but this was a known long standing issue with large ulp error
(depending on how ulp is defined near infinity).

The fix should have no effect on sinh and sinhf performance but may
have a tiny effect on cosh and coshf.

d2055814

math: fix __rem_pio2 in non-nearest rounding modes · b3797d3b

由 Szabolcs Nagy 提交于 1月 18, 2020

Handle when after reduction |y| > pi/4+tiny. This happens in directed
rounding modes because the fast round to int code does not give the
nearest integer. In such cases the reduction may not be symmetric
between x and -x so e.g. cos(x)==cos(-x) may not hold (but polynomial
evaluation is not symmetric either with directed rounding so fixing
that would require more changes with bigger performance impact).

The fix only adds two predictable branches in nearest rounding mode,
simple ubenchmark does not show relevant performance regression in
nearest rounding mode.

The code could be improved: e.g reducing the medium size threshold
such that two step reduction is enough instead of three, and the
single precision case can avoid the issue by doing the round to int
differently, but this fix was kept minimal.

b3797d3b

07 2月, 2020 5 次提交

remove i386 asm for single and double precision exp-family functions · a662220d

由 Rich Felker 提交于 2月 06, 2020

these did not truncate excess precision in the return value. fixing
them looks like considerable work, and the current C code seems to
outperform them significantly anyway.

long double functions are left in place because they are not subject
to excess precision issues and probably better than the C code.

a662220d

R
rename i386 exp.s to exp_ld.s · 2f0c31c0
由 Rich Felker 提交于 2月 06, 2020
```
this commit is for the sake of reviewable history.
```
2f0c31c0
R

fix excess precision in return value of i386 log-family functions · ab9e2090
由 Rich Felker 提交于 2月 06, 2020

ab9e2090
R
fix excess precision in return value of i386 acos[f] and asin[f] · 141c8d4c
由 Rich Felker 提交于 2月 06, 2020
```
analogous to commit 1c9afd69 for
atan[2][f].
```
141c8d4c

fix excess precision in return value of i386 atan[2][f] · 1c9afd69

由 Rich Felker 提交于 2月 06, 2020

for functions implemented in C, this is a requirement of C11 (F.6);
strictly speaking that text does not apply to standard library
functions, but it seems to be intended to apply to them, and C2x is
expected to make it a requirement.

failure to drop excess precision is particularly bad for inverse trig
functions, where a value with excess precision can be outside the
range of the function (entire range, or range for a particular
subdomain), breaking reasonable invariants a caller may expect.

1c9afd69

28 1月, 2020 1 次提交
- A
  
  math/x32: correct lrintl.s for 32-bit long · ff5b8ad3
  由 Alexander Monakov 提交于 1月 18, 2020
  
  ff5b8ad3
06 11月, 2019 1 次提交
- R
  
  ppc: add configure check for older compilers erroring on 'd' constraint · 66d1e312
  由 rofl0r 提交于 11月 05, 2019
  
  66d1e312
14 10月, 2019 2 次提交

mips: add single-instruction math functions · 1c9d2cba

由 info@mobile-stream.com 提交于 9月 11, 2019

SQRT.fmt exists on MIPS II+ (float), MIPS III+ (double).

ABS.fmt exists on MIPS I+ but only cores with ABS2008 flag in FCSR
implement the required behaviour.

1c9d2cba

math: fix signed int left shift ub in sqrt · e8580630

由 Szabolcs Nagy 提交于 10月 13, 2019

Both sqrt and sqrtf shifted the signed exponent as signed int to adjust
the bit representation of the result. There are signed right shifts too
in the code but those are implementation defined and are expected to
compile to arithmetic shift on supported compilers and targets.

e8580630

27 9月, 2019 1 次提交

math: optimize lrint on 32bit targets · ca577951

由 Szabolcs Nagy 提交于 9月 16, 2019

lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
is not trivial: rounding to int may be inexact, but the conversion to
int may overflow and then the inexact flag must not be raised. (the
overflow threshold is rounding mode dependent).

this matters on 32bit targets (without single instruction lrint or
rint), so the common case (when there is no overflow) is optimized by
inlining the lrint logic, otherwise the old code is kept as a fallback.

on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
on a smaller arm core: old c:71ns, new c:34ns
on a bigger arm core: old c:27ns, new c:19ns

ca577951

06 8月, 2019 2 次提交

R
fix build regression in i386 asm for atan2, atan2f · 6818c31c
由 Rich Felker 提交于 8月 05, 2019
```
commit f3ed8bfe inadvertently removed
labels that were still needed.
```
6818c31c

fix x87 stack imbalance in corner cases of i386 math asm · f3ed8bfe

由 Rich Felker 提交于 8月 05, 2019

commit 31c5fb80 introduced underflow
code paths for the i386 math asm, along with checks on the fpu status
word to skip the underflow-generation instructions if the underflow
flag was already raised. unfortunately, at least one such path, in
log1p, returned with 2 items on the x87 stack rather than just 1 item
for the return value. this is a violation of the ABI's calling
convention, and could cause subsequent floating point code to produce
NANs due to x87 stack overflow. if floating point results are used in
flow control, this can lead to runaway wrong code execution.

rather than reviewing each "underflow already raised" code path for
correctness, remove them all. they're likely slower than just
performing the underflow code unconditionally, and significantly more
complex.

all of this code should be ripped out and replaced by C source files
with inline asm. doing so would preclude this kind of error by having
the compiler perform all x87 stack register allocation and stack
manipulation, and would produce comparable or better code. however
such a change is a much larger project.

f3ed8bfe

15 6月, 2019 1 次提交

add riscv64 architecture support · 0a48860c

由 Rich Felker 提交于 5月 24, 2019

Author: Alex Suykov <alex.suykov@gmail.com>
Author: Aric Belsito <lluixhi@gmail.com>
Author: Drew DeVault <sir@cmpwn.com>
Author: Michael Clark <mjc@sifive.com>
Author: Michael Forney <mforney@mforney.org>
Author: Stefan O'Rear <sorear2@gmail.com>

This port has involved the work of many people over several years. I
have tried to ensure that everyone with substantial contributions has
been credited above; if any omissions are found they will be noted
later in an update to the authors/contributors list in the COPYRIGHT
file.

The version committed here comes from the riscv/riscv-musl repo's
commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by
me for issues found during final review:

- a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc
  are not safe to use in separate inline asm fragments)

- a_cas[_p] is fixed to be a memory barrier

- the call from the _start assembly into the C part of crt1/ldso is
  changed to allow for the possibility that the linker does not place
  them nearby each other.

- DTP_OFFSET is defined correctly so that local-dynamic TLS works

- reloc.h LDSO_ARCH logic is simplified and made explicit.

- unused, non-functional crti/n asm files are removed.

- an empty .sdata section is added to crt1 so that the
  __global_pointer reference is resolvable.

- indentation style errors in some asm files are fixed.

0a48860c

18 4月, 2019 10 次提交

math: new pow · e4dd6530

由 Szabolcs Nagy 提交于 12月 01, 2018

from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc

The underflow exception is signaled if the result is in the subnormal
range even if the result is exact.

code size change: +3421 bytes.
benchmark on x86_64 before, after, speedup:

-Os:
   pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x
    pow latency: 144.37 ns/call 54.75 ns/call 2.64x
-O3:
   pow rthruput:  98.91 ns/call 32.79 ns/call 3.02x
    pow latency: 138.74 ns/call 53.78 ns/call 2.58x

e4dd6530

math: new exp and exp2 · e16f7b3c

由 Szabolcs Nagy 提交于 11月 30, 2018

from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc

TOINT_INTRINSICS and EXP_USE_TOINT_NARROW cases are unused.

The underflow exception is signaled if the result is in the subnormal
range even if the result is exact (e.g. exp2(-1023.0)).

code size change: -1672 bytes.
benchmark on x86_64 before, after, speedup:

-Os:
   exp rthruput:  12.73 ns/call  6.68 ns/call 1.91x
    exp latency:  45.78 ns/call 21.79 ns/call 2.1x
  exp2 rthruput:   6.35 ns/call  5.26 ns/call 1.21x
   exp2 latency:  26.00 ns/call 16.58 ns/call 1.57x
-O3:
   exp rthruput:  12.75 ns/call  6.73 ns/call 1.89x
    exp latency:  45.91 ns/call 21.80 ns/call 2.11x
  exp2 rthruput:   6.47 ns/call  5.40 ns/call 1.2x
   exp2 latency:  26.03 ns/call 16.54 ns/call 1.57x

e16f7b3c

math: new log2 · 2a3210cf

由 Szabolcs Nagy 提交于 12月 01, 2018

from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc

code size change: +2458 bytes (+1524 bytes with fma).
benchmark on x86_64 before, after, speedup:

-Os:
  log2 rthruput:  16.08 ns/call 10.49 ns/call 1.53x
   log2 latency:  44.54 ns/call 25.55 ns/call 1.74x
-O3:
  log2 rthruput:  15.92 ns/call 10.11 ns/call 1.58x
   log2 latency:  44.66 ns/call 26.16 ns/call 1.71x

2a3210cf

math: new log · 236cd056

由 Szabolcs Nagy 提交于 12月 01, 2018

from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc

Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single
instruction.

code size change: +4588 bytes (+2540 bytes with fma).
benchmark on x86_64 before, after, speedup:

-Os:
   log rthruput:  12.61 ns/call  7.95 ns/call 1.59x
    log latency:  41.64 ns/call 23.38 ns/call 1.78x
-O3:
   log rthruput:  12.51 ns/call  7.75 ns/call 1.61x
    log latency:  41.82 ns/call 23.55 ns/call 1.78x

236cd056

math: new powf · d28cd0ad