c - Handling zeroes in _mm256_rsqrt_ps()

Question

Welcome To Ask or Share your Answers For Others

c - Handling zeroes in _mm256_rsqrt_ps()

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:

_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
              eightFloats);

Is the way to go for that extra bit of performance and avoiding a pipeline stall.

Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?

_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
                                            _mm256_set1_ps(0.1))),
              eightFloats);

My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:

float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
    int thisSqrt = (int) floor(sqrt(vals[i]));
    sum += min(thisSqrt, 0x3F);
}

All the values of vals should be integer values. (Why everything isn't just int is a different question...)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:33:54+0000

tl;dr: See the end for code that compiles and should work.

To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)

It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.

The thing you're missing:

Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.

Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).

Also:

Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.

It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2^-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).

63*1.5*2^(-12)         ==  0.023071...
approx_sqrt(63*63-1)   == 62.99206... +/- 0.023068..

Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:

define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140

Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?

You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2^-12 requires more instructions. I don't see any tricks a NR iteration.

If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.

; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087   # (x+0.5)/sqrt(x+0.5)
 ~0.00732879495710064718  ~0.00731963968187500662   # relative error

Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.

Anyway, I think you need one Newton-Raphson iteration.

The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).

FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.

Compiler output: a decent starting point

I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).

sum_fp doesn't auto-vectorize, even with the full -ffast-math.

However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX², otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.

See the asm output on godbolt from:

// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
  int sum = 0;
  for (int i = 0; i < MAX; i++) {
    int thisSqrt = (int) sqrtf(vals[i]);
    sum += std::min(thisSqrt, 0x3F);
  }
  return sum;
}

To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.

putting it together:

I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).

As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).

I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.

Horizontal sum from this answer.

Complete but untested version:

#include <immintrin.h>
#define MAX 4096

// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
    // clang / gcc usually use -3.0 and -0.5.  We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
    __m256 three = _mm256_set1_ps(3.0f);
    //__m256 half = _mm256_set1_ps(0.5f);  // we omit the *0.5 step

    __m256 nr  = _mm256_rsqrt_ps( x );  // initial approximation for Newton-Raphson
    //           1/sqrt(x) ~=    nr  * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
    // sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
    // 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
    __m256 xnr = _mm256_mul_ps( x, nr );

    __m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three );  // -(xnr*nr) + 3
    return _mm256_mul_ps( xnr, three_minus_muls );
}

// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
    __m256 offset = _mm256_set1_ps(0.5f);    // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
    __m256 xoff = _mm256_add_ps(x, offset);  // avoids 0*Inf = NaN, and rounding error before truncation
    __m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
    __m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
    return _mm256_srli_epi32(i2sqrtx, 1);     // divide by 2 with truncation
    // alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}

__m256i isqrt256_ps_simple_exact(__m256 x) {
    __m256 sqrt_x = _mm256_sqrt_ps(x);
    __m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
    return isqrtx;
}

int hsum_epi32_avx(__m256i x256){
    __m128i xhi = _mm256_extracti128_si256(x256, 1);
    __m128i xlo = _mm256_castsi256_si128(x256);
    __m128i x  = _mm_add_epi32(xlo, xhi);
    __m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
    hl  = _mm_add_epi32(hl, x);
    x   = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
    hl  = _mm_add_epi32(hl, x);
    return _mm_cvtsi128_si32(hl);
}

int sum_int_avx(float vals[]){
  __m256i sum = _mm256_setzero_si256();
  __m256i upperlimit = _mm256_set1_epi32(0x3F);

  for (int i = 0; i < MAX; i+=8) {
    __m256 v = _mm256_loadu_ps(vals+i);
    __m256i visqrt = isqrt256_ps(v);
    // assert visqrt == isqrt256_ps_simple_exact(v) or something
    visqrt = _mm256_min_epi32(visqrt, upperlimit);
    sum = _mm256_add_epi32(sum, visqrt);
  }
  return hsum_epi32_avx(sum);
}

Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.

Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated <cod

Categories

c - Handling zeroes in _mm256_rsqrt_ps()

c - Handling zeroes in _mm256_rsqrt_ps()

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

The thing you're missing:

Anyway, I think you need one Newton-Raphson iteration.

Compiler output: a decent starting point

putting it together:

Complete but untested version:

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags