Integers vs Floating-point Numbers
Floating-point numbers offer some distinct advantages over integers. To see why integers are exact and floating-point numbers are not, we will explore the way computers store and manipulate the integer and floating-point types.
Computers store all data internally in binary form. The decimal system uses 10 digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Figure 4.4 shows how the familiar base 10 place value system works.
- The base 10 place value system
- 473,406 = 4×105 +7×104 +3×103 +4×102 +0×101 +6×100
- = 400,000+70,000+3,000+400+0+6
- = 473,406
- The base 2 place value system
- 1001112 = 1×2 5 +0×2 4 +0×2 3 +1×2 2 +1×2 1 +1×2 0
- = 32+0+0+4+2+1
- = 39
With only two digits to work with, the binary number system distinguishes place values by powers of two. Sometimes to be very clear we will attach a subscript of 10 to a decimal number, as in 10010.
In the decimal system, it is easy to add 3+5: 3 + 5 = 8
The sum 3+9 is a little more complicated, 3 + 9 = 12
We can say 3+9 is 2, carry the 1. 1 + 03 + 09 = 12
- 02 +02 = 02
- 02 +12 = 12
- 12 +02 = 12
- 12 +12 = 102
Integer Implementation
Standard C++ supports multiple integer types: int, short, long, and long long, unsigned, unsigned short, unsigned long, and unsigned long long. The most commonly used integer type in C++ is int.The exact number of bits in an int is processor specific.
- Binary Bit String Decimal Value
- 00000 0
- 00001 1
- 00010 2
- 00011 3
- 00100 4
- 00101 5
- 00110 6
- 00111 7
- 01000 8
- 01001 9
- 01010 10
- 01011 11
- 01100 12
- 01101 13
- 01110 14
- 01111 15
- 10000 16
- 10001 17
- 10010 18
- 10011 19
- 10100 20
- 10101 21
- 10110 22
- 10111 23
- 11000 24
- 11001 25
- 11010 26
- 11011 27
- 11100 28
- 11101 29
- 11110 30
- 11111 31
Adding 1 to 4,294,967,295 produces 0, one position clockwise from 4,294,967295. Subtracting 4 from 2 yields 4,294,967,294, four places counter clock wise from 2.
Example
#include <iostream>
int main() {
int x = 2147483645; // Almost the largest possible int value
std::cout << x << " + 1 = " << x + 1 << '\n';
std::cout << x << " + 2 = " << x + 2 << '\n';
std::cout << x << " + 3 = " << x + 3 << '\n';
}
Floating-point Implementation
The standard C++ floating point types consist of float, double, and long double. As with the integer types, the different floating-point types may be distinguished by the number of bits of storage required and corresponding range of values. The type float stands for single-precision floating-point, and double stands for double-precision floating-point.
Single-precision floating-point numbers (type float) occupy 32 bits, distributed as follows:
- Mantissa 24 bits
- Exponent 7 bits
- Sign 1 bit
- Total 32 bits
Double-precision floating-point numbers (type double) require 64 bits:
- Mantissa 52 bits
- Exponent 11 bits
- Sign 1 bit
- Total 64 bits
Code Example
#include <iostream>
#include <iomanip>
int main() {
double d1 = 2000.5;
double d2 = 2000.0;
std::cout << std::setprecision(16) << (d1 - d2) << '\n';
double d3 = 2000.58;
double d4 = 2000.0;
std::cout << std::setprecision(16) << (d3 - d4) << '\n';
}
//Output
0.5
0.5799999999999272
Code Example
#include <iostream>
int main() {
double one = 1.0,
one_fifth = 1.0/5.0,
zero = one - one_fifth - one_fifth - one_fifth - one_fifth - one_fifth;
std::cout << "one = " << one << ", one_fifth = " << one_fifth
<< ", zero = " << zero << '\n';
}
//Output
one = 1, one_fifth = 0.2, zero = 5.55112e-017