Integers vs. Floating-point Numbers

Integers vs Floating-point Numbers

Floating-point numbers offer some distinct advantages over integers. To see why integers are exact and floating-point numbers are not, we will explore the way computers store and manipulate the integer and floating-point types.

Computers store all data internally in binary form. The decimal system uses 10 digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Figure 4.4 shows how the familiar base 10 place value system works.

  • The base 10 place value system
  • 473,406 = 4×105 +7×104 +3×103 +4×102 +0×101 +6×100
  • = 400,000+70,000+3,000+400+0+6
  • = 473,406
  • The base 2 place value system
  • 1001112 = 1×2 5 +0×2 4 +0×2 3 +1×2 2 +1×2 1 +1×2 0
  • = 32+0+0+4+2+1
  • = 39

With only two digits to work with, the binary number system distinguishes place values by powers of two. Sometimes to be very clear we will attach a subscript of 10 to a decimal number, as in 10010.

In the decimal system, it is easy to add 3+5: 3 + 5 = 8

The sum 3+9 is a little more complicated, 3 + 9 = 12

We can say 3+9 is 2, carry the 1. 1 + 03 + 09 = 12

  • 02 +02 = 02
  • 02 +12 = 12
  • 12 +02 = 12
  • 12 +12 = 102

Integer Implementation

Standard C++ supports multiple integer types: int, short, long, and long long, unsigned, unsigned short, unsigned long, and unsigned long long. The most commonly used integer type in C++ is int.The exact number of bits in an int is processor specific.

  • Binary Bit String Decimal Value
  • 00000 0
  • 00001 1
  • 00010 2
  • 00011 3
  • 00100 4
  • 00101 5
  • 00110 6
  • 00111 7
  • 01000 8
  • 01001 9
  • 01010 10
  • 01011 11
  • 01100 12
  • 01101 13
  • 01110 14
  • 01111 15
  • 10000 16
  • 10001 17
  • 10010 18
  • 10011 19
  • 10100 20
  • 10101 21
  • 10110 22
  • 10111 23
  • 11000 24
  • 11001 25
  • 11010 26
  • 11011 27
  • 11100 28
  • 11101 29
  • 11110 30
  • 11111 31

Adding 1 to 4,294,967,295 produces 0, one position clockwise from 4,294,967295. Subtracting 4 from 2 yields 4,294,967,294, four places counter clock wise from 2.

The cyclic nature of 32-bit unsigned integers AndroWep-Tutorials
The cyclic nature of 32-bit unsigned integers AndroWep-Tutorials

Example

#include <iostream>
int main() {
    int x = 2147483645; // Almost the largest possible int value
    std::cout << x << " + 1 = " << x + 1 << '\n';
    std::cout << x << " + 2 = " << x + 2 << '\n';
    std::cout << x << " + 3 = " << x + 3 << '\n';
}

Floating-point Implementation

The standard C++ floating point types consist of float, double, and long double. As with the integer types, the different floating-point types may be distinguished by the number of bits of storage required and corresponding range of values. The type float stands for single-precision floating-point, and double stands for double-precision floating-point.

Single-precision floating-point numbers (type float) occupy 32 bits, distributed as follows:

  • Mantissa 24 bits
  • Exponent 7 bits
  • Sign 1 bit
  • Total 32 bits

Double-precision floating-point numbers (type double) require 64 bits:

  • Mantissa 52 bits
  • Exponent 11 bits
  • Sign 1 bit
  • Total 64 bits

Code Example

#include <iostream>
#include <iomanip>
int main() {
    double d1 = 2000.5;
    double d2 = 2000.0;
    std::cout << std::setprecision(16) << (d1 - d2) << '\n';
    double d3 = 2000.58;
    double d4 = 2000.0;
    std::cout << std::setprecision(16) << (d3 - d4) << '\n';
}
//Output
0.5
0.5799999999999272

Code Example

#include <iostream>
int main() {
    double one = 1.0,
    one_fifth = 1.0/5.0,
    zero = one - one_fifth - one_fifth - one_fifth - one_fifth - one_fifth;
    std::cout << "one = " << one << ", one_fifth = " << one_fifth
    << ", zero = " << zero << '\n';
}
//Output
one = 1, one_fifth = 0.2, zero = 5.55112e-017