floating point - Emulating FP64 with 2 FP32 on a GPU

Question

Welcome To Ask or Share your Answers For Others

floating point - Emulating FP64 with 2 FP32 on a GPU

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

floating point - Emulating FP64 with 2 FP32 on a GPU

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well?

Currently Nvidia is charging quite a premium for double precision enabled Tesla cards which enable you to get one third of the single precision performance (notable exceptions Titan/Titan Black).

If one were to use a Geforce GPU with gimped double precision and emulate double precision using 2 single precision floats what would the performance be like?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

2.1m questions

2.1m answers

60 comments

57.0k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深蓝 · Answer 1 · 2021-10-17T02:51:21+0000

You can get a rough estimate of the performance by counting the number of float operations required to implement each double-float operation. You would want to inspect binary code with cuobjdump --dump-sass to get an accurate count. I am showing a double-float multiplication below that takes full advantage of FMA (fused multiply-add) support on the GPU. For double-float addition code, I would point you to a paper by Andrew Thall as I do not have the time to code this up right now. From previous analysis I believe the addition code given in the paper is correct, and that it avoids common pitfalls in faster but less accurate implementations (which lose accuracy when the magnitude of the operands is within a factor of two).

If you are a registered CUDA developer you can download double-double code from NVIDIA's developer website (log in at https://developer.nvidia.com) which is under BSD license, and rework it relatively quickly into double-float code. NVIDIA's double-double code supports the operations addition, subtraction, division, square root, and reciprocal square root.

As you can see, the multiplication below requires 8 float instructions; unary negation is absorbed into FMA. The addition requires around 20 float instructions. However, the instruction sequences for double-float operations also require temporary variables, which increases register pressure and can decrease occupancy. A reasonably conservative estimate may therefore be that double-float arithmetic performs at 1/20 the throughput of native float arithmetic. You can easily measure this yourself, in the context relevant to you, i.e. your use case(s).

typedef float2 dblfloat;  // .y = head, .x = tail

__host__ __device__ __forceinline__ 
dblfloat mul_dblfloat (dblfloat x, dblfloat y)
{
    dblfloat t, z;
    float sum;
    t.y = x.y * y.y;
    t.x = fmaf (x.y, y.y, -t.y);
    t.x = fmaf (x.x, y.x, t.x);
    t.x = fmaf (x.y, y.x, t.x);
    t.x = fmaf (x.x, y.y, t.x);
    /* normalize result */
    sum = t.y + t.x;
    z.x = (t.y - sum) + t.x;
    z.y = sum;
    return z;
}

Note that in various applications, full double-float arithmetic may not be necessary. Instead one can use float computation, augmented by error compensating techniques, one of the oldest of which is the Kahan summation. I gave a brief overview of easily available literature on such methods in a recent posting in the NVIDIA developer forums. In the comments above, Robert Crovella also pointed to a GTC 2015 talk by Scott LeGrand, which I haven't had time to check out yet.

As for accuracy, double-float has a representational precision of 49 (24+24+1) bits, compared with IEEE-755 double which provides 53 bits. However double-float cannot maintain this precision for operands small in magnitude, as the tail portion can become a denormal or zero. When denormal support is turned on, the 49 bits of precision are guaranteed for 2^-101 <= |x| < 2¹²⁸. Denormal support for float is turned on by default in the CUDA tool chain for architectures >= sm_20, which means all architectures supported by the currently shipping version, CUDA 7.0.

As opposed to operations on IEEE-754 double data, double-float operations are not correctly rounded. For the double-float multiplication above, using 2 billion random test cases (with all source operands and results within the bounds stated above), I observed an upper bound of 1.42e-14 for the relative error. I do not have data for the double-float addition, but its error bound should be similar.

Categories

floating point - Emulating FP64 with 2 FP32 on a GPU

floating point - Emulating FP64 with 2 FP32 on a GPU

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags