c++ – Real numbers – how to determine whether float or double is required?-ThrowExceptions

Exception or error:

Given a real value, can we check if a float data type is enough to store the number, or a double is required?

I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?

How to solve:

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic

Unfortunately, I don’t think there is any way to automate the decision.

Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.

In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.

In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.

That leads to the following strategy:

If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

###

I think your question presupposes a way to specify any “real number” to C / C++ (or any other program) without precision loss.

Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.

If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:

More over, there are numbers, such as 0.9, that float / double cannot in theory represent “exactly” )at least not in our binary computation paradigm) – see Jon Skeet’s excellent answer on this.

Lastly, see additional discussion on float vs. double.

###

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.

Single precision assigns 23 bits of “mantissa,” or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.

Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).

The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.

As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that’s really outside the scope of the question. It depends on the algorithm and the application.

###

A very detailed post that may or may not answer your question.

An entire series in floating point complexities!

###

Couldn’t you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double – if there is no difference, the float is sufficient?

float f = value;
double d = value;
if ((double)f == d)
{
     // float is sufficient
}

###

You cannot represent real number with float or double variables, but only a subset of rational numbers.

When you do floating point computation, your CPU floating point unit will decide the best approximation for you.

I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Leave a Reply

Your email address will not be published. Required fields are marked *