Computer Memory & Data Representation
Computer Memory & Data Representation
Computer uses a fixed number of bits to represent a piece of data, which could be a number, a character, or others. A nbit storage location can represent up to 2^n
distinct entities. For example, a 3bit memory location can hold one of these eight binary patterns: 000B
, 001B
, 010B
, 011B
, 100B
, 101B
, 110B
, or 111B
. Hence, it can represent at most 8 distinct entities. You could use them to represent numbers 0 to 7, 11 to 18, characters 'A' to 'H', or up to 8 kinds of fruits like apple, orange, banana, etc.
Integers, for example, can be represented in 8bit, 16bit, 32bit or 64bit. You, as the programmer, choose an appropriate bitlength for your integers. Your choice will impose constraint on the range of integers that can be represented. Besides the bitlength, an integer can be represented in various representation schemes, e.g., unsigned vs. signed integers. An 8bit signed integer has a range of 128 to 127; while an 8bit unsigned integer has a range of 0 to 255.
It is important to note that a computer memory location merely stores a binary pattern. It is entirely up to you, as the programmer, to decide on how to interpret these patterns. For example, the 8bit binary pattern "0100 0001B"
can be interpreted as an unsigned integer 65D
, or an ASCII character 'A'
, or some secret information known only to you. In other words, you have to first decide how to represent a piece of data in a binary pattern before the binary patterns make sense. The interpretation of binary pattern is called data representation or encoding. Furthermore, it is important that the data representation schemes be agreedupon by all the parties, i.e., industrial standards need to be formulated and straightly followed.
Once you decided on the data representation scheme, certain constraints, in particular, the precision and range will be imposed. Hence, it is important to understand data representation to write correct and highperformance programs.
Integer Representation
Integers are whole numbers or fixedpoint numbers with the radix point fixed after the least significant bit. They are contrast to real numbers or floatingpoint numbers, where the position of the radix point varies. It is important to take note that integers and floatingpoint numbers are treated differently in computers. They have different representation and are processed differently (e.g., floatingpoint numbers are processed in a socalled floatingpoint processor). Floatingpoint numbers will be discussed later.
Computers use a fixed number of bits to represent an integer. The commonlyused bitlengths for integers are 8bit, 16bit, 32bit or 64bit. Besides bitlengths, there are two representation schemes for integers:
 Unsigned Integers: can represent zero and positive integers.

Signed Integers: can represent zero, positive and negative integers. Three representation schemes had been proposed for signed integers:
 SignMagnitude representation
 1's Complement representation
 2's Complement representation
You, as the programmer, need to decide on the bitlength and representation scheme for your numbers, depending on your application's requirements. Suppose that you need a counter for counting a small quantity from 0 up to 200, you might choose the 8bit unsigned integer scheme as there is no negative numbers involved.
nbit Unsigned Integers
Unsigned integers can represent zero and positive integers, but not negative integers. The value of an unsigned integer is interpreted as "the magnitude of its underlying binary pattern".
Example 1: Suppose that n=8 and the binary pattern is 0100 0001B
, the value of this unsigned integer is 1×2^6 + 1×2^0 = 65D
Example 2: Suppose that n=16 and the binary pattern is 0001 0000 0000 1000B
, the value of this unsigned integer is 1×2^12 + 1×2^3 = 4104D
Example 3: Suppose that n=16 and the binary pattern is 0000 0000 0000 0000B
, the value of this unsigned integer is 0.
An nbit pattern can represent 2^n
distinct integers. An nbit unsigned integer can represent integers from 0
to (2^n)1
, as tabulated below:
n  Minimum  Maximum 

8  0  (2^8)1 (=255) 
16  0  (2^16)1 (=65,535) 
32  0  (2^32)1 (=4,294,967,295) (9+ digits) 
64  0  (2^64)1 (=18,446,744,073,709,551,615) (19+ digits) 
Signed Integers
Signed integers can represent zero, positive integers, as well as negative integers. Three representation schemes had been proposed for signed integers:
 SignMagnitude representation
 1's Complement representation
 2's Complement representation
In all the above three schemes, the most significant bit (msb) is called the sign bit, which is used to represent the sign of the integer  with 0 for positive integers and 1 for negative integers. The magnitude of the integer, however, is interpreted differently.
nbit Sign Integers in SignMagnitude Representation
In signmagnitude representation:
 The most significant bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
 The remaining n1 bits represents the magnitude (absolute value) of the integer. The absolute value of the integer is interpreted as "the magnitude of the (n1)bit binary pattern".
Example 1: Suppose that n=8 and the binary representation is 0 100 0001B
.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation is 1 000 0001B
.
Sign bit is 1 ⇒ negative
Absolute value is 000 0001B = 1D
Hence, the integer is 1D
Example 3: Suppose that n=8 and the binary representation is 0 000 0000B
.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation is 1 000 0000B
.
Sign bit is 1 ⇒ negative
Absolute value is 000 0000B = 0D
Hence, the integer is 0D
Take note that there are two representations (0000 0000B
and 1000 0000B
) for the number zero. Furthermore, positive numbers and negative numbers need to be processed separately.
nbit Sign Integers in 1's Complement Representation
In 1's complement representation:
 Again, the most significant bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
 The remaining n1 bits represents the magnitude of the integer, as follows:
 for positive integers, the absolute value of the integer is equal to "the magnitude of the (n1)bit binary pattern".
 for negative integers, the absolute value of the integer is equal to "the magnitude of the complement (inverse) of the (n1)bit binary pattern" (hence called 1's complement).
Example 1: Suppose that n=8 and the binary representation 0 100 0001B
.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation 1 000 0001B
.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B, i.e., 111 1110B = 126D
Hence, the integer is 126D
Example 3: Suppose that n=8 and the binary representation 0 000 0000B
.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation 1 111 1111B
.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B, i.e., 000 0000B = 0D
Hence, the integer is 0D
Again, take note that there are again two representations (0000 0000B
and 1111 1111B
) for zero. The positive integers and negative integers need to be processed separately.
nbit Sign Integers in 2's Complement Representation
In 2's complement representation:
 Again, the most significant bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
 The remaining n1 bits represents the magnitude of the integer, as follows:
 for positive integers, the absolute value of the integer is equal to "the magnitude of the (n1)bit binary pattern".
 for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the (n1)bit binary pattern plus one" (hence called 2's complement).
Example 1: Suppose that n=8 and the binary representation 0 100 0001B
.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation 1 000 0001B
.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B plus 1, i.e., 111 1110B + 1B = 127D
Hence, the integer is 127D
Example 3: Suppose that n=8 and the binary representation 0 000 0000B
.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation 1 111 1111B
.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B plus 1, i.e., 000 0000B + 1B = 1D
Hence, the integer is 1D
Computers use 2's Complement Representation for Sign Integers
We have discussed three representations for signed integers: signedmagnitude, 1's complement and 2's complement. Computers use 2's complement in representing signed integers. This is because:
 There is only one representation for the number zero in 2's complement, instead of two representations in signmagnitude and 1's complement.
 Positive and negative integers can be treated together in addition and subtraction. Subtraction can be carried out using the "addition logic" by wrapping over and discarding the carry bit.
Example 1 (Addition of two positive numbers): Suppose that n=8, 65D + 5D = 70D
\n\n65D → 0100 0001B 5D → 0000 0101B(+ 0100 0110B → 70D (OK)
Example 2 (Subtraction of two positive numbers can be treated as addition of a positive and a negative numbers): Suppose that n=8, 5D  5D = 65D + (5D) = 60D
\n\n65D → 0100 0001B 5D → 1111 1011B(+ 0011 1100B → 60D (discard carry  OK)
Example 3 (Addition of two negative numbers): Suppose that n=8, 65D  5D = (65D) + (5D) = 70D
\n\n65D → 1011 1111B 5D → 1111 1011B(+ 1011 1010B → 70D (discard carry  OK)
Because of the fixed precision (i.e., fixed number of bits), an nbit 2's complement signed integer has a certain range. For example, for n=8, the range of 2's complement signed integers is 128 to +127. During addition, it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.
Example 4 (Overflow): Suppose that n=8, 127D + 2D = 129D
(overflow)
\n\n127D → 0111 1111B 2D → 0000 0010B(+ 1000 0001B → 127D (wrong)
Example 5 (Underflow): Suppose that n=8, 125D  5D = 130D
(underflow)
\n\n125D → 1000 0011B 5D → 1111 1011B(+ 0111 1110B → +126D (wrong)
The following diagram explains how the 2's complement works. By rearranging the number line, values from 128 to +127 are represented contiguously by ignoring the carry bit.
Range of nbit 2's Complement Signed Integers
An nbit 2's complement signed integer can represent integers from 2^(n1)
to +2^(n1)1
, as tabulated. Take note that the scheme can represent all the integers within the range, without any gap (i.e., there is no missing integers inside the supported range).
n  minimum  maximum 

8  (2^7) (= 128)  +(2^7)1 (= +127) 
16  (2^15) (= 32,768)  +(2^15)1 (= +32,767) 
32  (2^31) (= 2,147,483,648)  +(2^31)1 (= +2,147,483,647)(9+ digits) 
64  (2^63) (= 9,223,372,036,854,775,808)  +(2^63)1 (= +9,223,372,036,854,775,807)(18+ digits) 
Decoding 2's Complement Numbers
 Check the sign bit (denoted as
S
).  If
S = 0
, the number is positive and its absolute value is the binary value of the remaining n1 bits.  If
S = 1
, the number is negative. Scan the remaining n1 bits from the right (least significant bit). Look for the first occurrence of 1. Flip all the bits to the left of that first occurrence of 1. The flipped pattern gives the absolute value. For example,\n\nn = 8, bit pattern = 1 100 0100B S = 1 → negative Scanning from the right and flip all the bits to the left of the first occurrence of 1 ⇒ 100B = 60D Hence, the value is 60D
Alternatively, you could also "invert all the bits and plus 1" to get the value of negative number.
Exercise (Integer Representation)
 What are the ranges of 8bit, 16bit, 32bit and 64bit integer, in "unsigned" and "signed" representation?
 Give the value of
+88
,88
,1
,0
,+1
,128
, and+127
in 8bit signed representation.
Ans: (1) Range of unsigned nbit integer is [0, 2^(n1)]
. Range of signed nbit integer is [2^(n1), +2^(n1)1]
; (2) +88 (0101 1000)
, 88 (1010 1000)
, 1 (1111 1111)
, 0 (0000 0000)
, +1 (0000 0001)
, 128 (1000 0000)
, +127 (0111 1111)
.
FloatingPoint Number Representation
A floatingpoint number (or real number) can represent a very large (1×10^50
) or a very small (1×10^50
) value. It is typically expressed in the scientific notation, with a fraction (F
) and an exponent (E
) of a certain radix (r
), in the form of F × r^E
.
Representation of floating point number is not unique. For example, the number 55.66
can be represented as 5.566×10^1
, 0.5566×10^2
, 0.05566×10^3
, etc. The fractional part can be normalized. In the normalized form, there is only a single nonzero digit before the radix point. For example, decimal number 123.4567
can be normalized as 1.234567×10^2
; binary number 1010.1011B
can be normalized as 1.011011B×2^3
.
It is important to note that floatingpoint numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32bit or 64bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 1.0). On the other hand, a nbit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy.
It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic (which could be speed up with a socalled floatingpoint coprocessor). Hence, use integers if your application does not require floatingpoint numbers.
In computers, floatingpoint numbers are represented in scientific notation of fraction (F
) and exponent (E
) with a radix of 2, in the form of F×2^E
. Both E
and F
can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floatingpoint numbers. There are two representation schemes: 32bit singleprecision and 64bit doubleprecision.
IEEE754 32bit SinglePrecision FloatingPoint Numbers
In 32bit singleprecision floatingpoint representation:
 The most significant bit is the sign bit (
S
), with 0 for negative numbers and 1 for positive numbers.  The following 8 bits represent exponent (
E
).  The remaining 23 bits represents fraction (
F
).
The value (N
) is calculated as follows:
 For
1 ≤ E ≤ 254, N = (1)^S × 1.F × 2^(E127)
. These numbers are in the socalled normalized form. The signbit represents the sign of the number. Fractional part (1.F
) are normalized with an implicit leading 1.0. The exponent is bias (or in excess) of 127, so as to represent both positive and negative exponent.  For
E = 0, N = (1)^S × 0.F × 2^(126)
. These numbers are in the socalled denormalized form. The exponent of 2^126 evaluate to a very small number. Denormalized form represents very small positive and negative number close to zero. The number zero is represented with F=0, E=0.  For
E = 255
, it represents special values, such as±INF
(infinity),NaN
(not a number).
Example 1: Suppose that IEEE754 32bit floatingpoint representation pattern is 0 10000000 110 0000 0000 0000 0000 0000B
\n\nSign bit = 0 ⇒ positive number E = 1000 0000B = 128D (in normalized form) Fraction is 1.110...0 = 1 + 2^1 + 2^2 = 1.75D The number is +1.75 × 2^(128127) = +3.5D
Example 2: Suppose that IEEE754 32bit floatingpoint representation pattern is 1 01111110 100 0000 0000 0000 0000 0000B
\n\nSign bit = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is 1.1B = 1 + 2^1 = 1.5D The number is 1.5 × 2^(126127) = 0.75D
IEEE754 64bit DoublePrecision FloatingPoint Numbers
The representation scheme for 64bit doubleprecision is similar to the 32bit singleprecision:
 The most significant bit is the sign bit (
S
), with 0 for negative numbers and 1 for positive numbers.  The following 11 bits represent exponent (
E
).  The remaining 52 bits represents fraction (
F
).
The value (N
) is calculated as follows:
 Normalized form: For
1 ≤ E ≤ 2046, N = (1)^S × 1.F × 2^(E1023)
.  Denormalized form: For
E = 0, N = (1)^S × 0.F × 2^(1022)
. These are in the denormalized form.  For
E = 2047
,N
represents special values, such as±INF
(infinity),NaN
(not a number).
More on FloatingPoint Representation
There are three parts in the floatingpoint representation:
 The sign bit (
S
) is selfexplanatory (0 for positive numbers and 1 for negative numbers).  For the exponent (
E
), a socalled bias (or excess) is applied so as to represent both positive and negative exponent. The bias is set at half of the range. For single precision with an 8bit exponent, the bias is 127 (or excess127). For double precision with a 11bit exponent, the bias is 1023 (or excess1023).  The fraction (
F
) (also called the mantissa or significand) is composed of an implicit leading bit (before the radix point) and the fractional bits (after the radix point). The leading bit for normalized numbers is 1; while the leading bit for denormalized numbers is 0.
Normalized FloatingPoint Numbers
In normalized form, the radix point is placed after the first nonzero digit, e,g., 9.8765D×10^23D
, 1.001011B×2^11B
. For binary number, the leading bit is always 1, and need not be represented explicitly  this saves 1 bit of storage.
In IEEE 754's normalized form:
 For singleprecision,
1 ≤ E ≤ 254
with excess of 127. Hence, the actual exponent is from126
to+127
. Negative exponents are used to represent small numbers (< 1.0); while positive exponents are used to represent large numbers (> 1.0).
N = (1)^S × 1.F × 2^(E127)
 For doubleprecision,
1 ≤ E ≤ 2046
with excess of 1023. The actual exponent is from1022
to+1023
, and
N = (1)^S × 1.F × 2^(E1023)
Take note that nbit pattern has a finite number of combinations (=2^n
), which could represent finite distinct numbers. It is not possible to represent the infinite numbers in the real axis (even a small range says 0.0 to 1.0 has infinite numbers). That is, not all floatingpoint numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accuracy.
The minimum and maximum normalized floatingpoint numbers are:
Precision  Normalized N(min)  Normalized N(max) 

Single  0080 0000H 0 00000001 00000000000000000000000B E = 1, F = 0 N = 1.0B × 2^126 (≈1.17549435 × 10^38) 
7F7F FFFFH 0 11111110 00000000000000000000000B E = 254, F = 0 N = 1.1...1B × 2^127 = (2  2^23) × 2^127 (≈3.4028235 × 10^38) 
Double  0010 0000 0000 0000H 1.0B × 2^1022 (≈2.2250738585072014 × 10^308) 
7FEF FFFF FFFF FFFFH 1.1...1B × 2^1023 = (2  2^52) × 2^1023 (≈1.7976931348623157 × 10^308) 
Denormalized FloatingPoint Numbers
If E = 0
, but the fraction is nonzero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:
 For singleprecision,
E = 0
,
N = (1)^S × 0.F × 2^(126)
 For doubleprecision,
E = 0
,
N = (1)^S × 0.F × 2^(1022)
Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floatingpoint numbers are:
Precision  Denormalized D(min)  Denormalized D(max) 

Single  0000 0001H 0 00000000 00000000000000000000001B E = 0, F = 00000000000000000000001B 0.0...1 × 2^126 = 1 × 2^23 × 2^126 = 2^149 (≈1.4 × 10^45) 
007F FFFFH 0 00000000 11111111111111111111111B E = 0, F = 11111111111111111111111B 0.1...1 × 2^126 = (1  2^23) × 2^126 (≈1.1754942 × 10^38) 
Double  0000 0000 0000 0001H 0.0...1 × 2^1022 = 1 × 2^52 × 2^1022 = 2^1074 (≈4.9 × 10^324) 
001F FFFF FFFF FFFFH 0.1...1 × 2^1022 = (1  2^52) × 2^1022 (≈4.4501477170144023 × 10^308) 
Note For Java Users: You can use JDK methods Float.intBitsToFloat(int bits)
or Double.longBitsToDouble(long bits)
to create a singleprecision float
or doubleprecision double
with the specific bit patterns, and obtain the above values. For examples,
\n\nSystem.out.println(Float.intBitsToFloat(0x7fffff)); System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));
Special Values
Zero: Zero cannot be represented in the normalized form, and must be represented in denormalized form with E=0
and F=0
.
Infinity: The value of +infinity (e.g., 1/0
) and infinity (e.g., 1/0
) are represented with an exponent of all 1's (E = 255
for singleprecision and E = 2047
for doubleprecision) and a fraction of all 0's.
Not a Number (NaN): NaN denotes a value that cannot be represented as real number (e.g. 0/0
). NaN is represented with Exponent of all 1's (E = 255
for singleprecision and E = 2047
for doubleprecision) and a nonzero fraction.
Character Encoding
In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set", "charset", "character map", or "code page").
For example, in ASCII (as well as Latin1, Unicode, and many other character sets):
 code numbers
65D (41H)
to90D (5AH)
represents'A'
to'Z'
, respectively.  code numbers
97D (61H)
to122D (7AH)
represents'a'
to'z'
, respectively.  code numbers
48D (30H)
to57D (39H)
represents'0'
to'9'
, respectively.
It is important to note that the representation scheme must be known before a binary pattern can be interpreted. E.g., the 8bit pattern "0100 0010B
" could represent anything under the sun known only to the person encoded it.
The most commonlyused character encoding schemes are: 7bit ASCII (ISO/IEC 646) and 8bit Latinx (ISO/IEC 8859x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).
A 7bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An 8bit character encoding scheme (such as Latinx) can represent 256 characters and symbols; whereas a 16bit encoding scheme (such as Unicode UCS2) can represents 65,536 characters and symbols.
7bit ASCII Code (aka USASCII, ISO/IEC 646, ITUT T.50)
 ASCII (American Standard Code for Information Interchange) is one of the earlier character coding schemes.
 ASCII is originally a 7bit code. It has been extended to 8bit to better utilize the 8bit computer memory organization. (The 8thbit was originally used for parity check in the early computers.)
 Code numbers
32D (20H)
to126D (7EH)
are printable (displayable) characters as tabulated:0 1 2 3 4 5 6 7 8 9 A B C D E F 2 SP ! " # $ % & ' ( ) * + ,  . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ \ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z {  } ~  Code number
32D (20H)
is the blank or space character. '0'
to'9'
:30H39H (0011 0001B to 0011 1001B)
or(0011 xxxxB
wherexxxx
is the equivalent integer value)
'A'
to'Z'
:41H5AH (0101 0001B to 0101 1010B)
or(010x xxxxB)
.'A'
to'Z'
are continuous without gap.'a'
to'z'
:61H7AH (0110 0001B to 0111 1010B)
or(011x xxxxB)
.'A'
to'Z'
are also continuous without gap. However, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit5.
 Code number
 Code numbers
0D (00H)
to31D (1FH)
, and127D (7FH)
are special control characters, which are nonprintable (nondisplayable), as tabulated below. Many of these characters were used in the early days for transmission control (e.g., STX, ETX) and printer control (e.g., FormFeed), which are now obsolete. The remaining meaningful codes today are:09H
for Tab ('\t'
).0AH
for LineFeed or newline (LF,'\n'
) and0DH
for CarriageReturn (CR,'r'
), which are used as line delimiter (aka line separator, endofline) for text files. There is unfortunately no standard for line delimiter: Unixes use 0AH ("\n
"), Windows use 0D0AH ("\r\n
"), Macs use 0DH ("\r
"). Programming languages such as C/C++/Java (which was created on Unix) use 0AH ("\n
"). In programming languages such as C/C++/Java, linefeed (0AH) is denoted as
'\n'
, carriagereturn (0DH) as'\r'
, tab (09H) as'\t'
.
DEC  HEX  Meaning  DEC  HEX  Meaning  

0  00  NUL  Null  17  11  DC1  Device Control 1 
1  01  SOH  Start of Heading  18  12  DC2  Device Control 2 
2  02  STX  Start of Text  19  13  DC3  Device Control 3 
3  03  ETX  End of Text  20  14  DC4  Device Control 4 
4  04  EOT  End of Transmission  21  15  NAK  Negative Ack. 
5  05  ENQ  Enquiry  22  16  SYN  Sync. Idle 
6  06  ACK  Acknowledgement  23  17  ETB  End of Transmission 
7  07  BEL  Bell  24  18  CAN  Cancel 
8  08  BS  Back Space '\b' 
25  19  EM  End of Medium 
9  09  HT  Horizontal Tab '\t' 
26  1A  SUB  Substitute 
10  0A  LF  Line Feed '\n' 
27  1B  ESC  Escape 
11  0B  VT  Vertical Feed  28  1C  IS4  File Separator 
12  0C  FF  Form Feed 'f' 
29  1D  IS3  Group Separator 
13  0D  CR  Carriage Return '\r' 
30  1E  IS2  Record Separator 
14  0E  SO  Shift Out  31  1F  IS1  Unit Separator 
15  0F  SI  Shift In  
16  10  DLE  Datalink Escape  127  7F  DEL  Delete 
8bit Latin1 (aka ISO/IEC 88591)
ISO/IEC8859 is a collection of 8bit character encoding standards for the western languages.
ISO/IEC 88591, aka Latin alphabet No. 1, or Latin1 in short, is the most commonlyused encoding scheme for western european languages. It has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin1 is backward compatible with the 7bit USASCII code. That is, the first 128 characters in Latin1 (code numbers 0 to 127 (7FH)), is the same as USASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given as follows:
0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  

A  NBSP  ¡  ¢  £  ¤  ¥  ¦  §  ¨  ©  ª  «  ¬  SHY  ®  ¯ 
B  °  ±  ²  ³  ´  µ  ¶  ·  ¸  ¹  º  »  ¼  ½  ¾  ¿ 
C  À  Á  Â  Ã  Ä  Å  Æ  Ç  È  É  Ê  Ë  Ì  Í  Î  Ï 
D  Ð  Ñ  Ò  Ó  Ô  Õ  Ö  ×  Ø  Ù  Ú  Û  Ü  Ý  Þ  ß 
E  à  á  â  ã  ä  å  æ  ç  è  é  ê  ë  ì  í  î  ï 
F  ð  ñ  ò  ó  ô  õ  ö  ÷  ø  ù  ú  û  ü  ý  þ  ÿ 
ISO/IEC8859 has 16 parts. Besides the most commonlyused Part 1, Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for SouthEastern European.
Other 8bit Extension of USASCII (ASCII Extensions)
Beside the standardized ISO8859x, there are many 8bit ASCII extensions, which are not compatible with each others.
ANSI () (aka Windows1252, or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. It is a superset of ISO88591 with code numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" singlequotes and doublequotes. A common problem in web broswers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labelled as ISO88591 (instead of Windows1252), where these code numbers are undefined. Most modern browsers and email clients treat charset ISO88591 as Windows1252 in order to accommodate such mislabeling.
0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  

8  €  ‚  ƒ  „  …  †  ‡  ˆ  ‰  Š  ‹  Œ  Ž  
9  ‘  ’  “  ”  •  –  —  ™  š  ›  œ  ž  Ÿ 
EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.
Unicode (aka ISO/IEC 10646 Universal Character Set)
Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO8859x family). Even a single language like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.e., the same code number is assigned to different characters.
Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a nonprofit organization called the Unicode Consortium (@www.unicode.org). Unicode is an ISO/IEC standard 10646.
Unicode is backward compatible with the 7bit USASCII and 8bit Latin1 (ISO88591). That is, the first 128 characters are the same as USASCII; and the first 256 characters are the same as Latin1.
Unicode originally uses 16 bits (called UCS2 or Unicode Character Set  2 byte), which can represent up to 65,536 characters. It has since been expanded to more than 16 bits, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 bits or about 2 million characters), covering all current and ancient historical scripts. The original 16bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), covering all the major languages in use currently. The characters outside BMP are called Supplementary Characters, which are not frequentlyused.
Unicode has two encoding schemes:
 UCS2 (Universal Character Set  2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS2 is now obsolete.
 UCS4 (Universal Character Set  4 Byte): Uses 4 bytes (32 bits), covering BMP and the supplementary characters.
UTF8 (Unicode Transformation Format  8bit)
The 16/32bit Unicode (UCS2/4) is grossly inefficient if the document contains mainly ASCII characters, because each character occupies two bytes of storage. Variablelength encoding schemes, such as UTF8, which uses 14 bytes to represent a character, was devised to improve the efficiency. In UTF8, the 128 commonlyused USASCII characters use only 1 byte, but some lesscommonly characters may require up to 4 bytes. Overall, the efficiency improved for document containing mainly USASCII texts.
The transformation between Unicode and UTF8 is as follows:
Bits  Unicode  UTF8 Code  Bytes 

7 
00000000 0xxxxxxx

0xxxxxxx

1 (ASCII) 
11 
00000yyy yyxxxxxx

110yyyyy 10xxxxxx

2 
16 
zzzzyyyy yyxxxxxx

1110zzzz 10yyyyyy 10xxxxxx

3 
21 
000uuuuu zzzzyyyy yyxxxxxx

11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

4 
In UTF8, Unicode numbers corresponding to the 7bit ASCII characters are padded with a leading zero; thus has the same value as ASCII. Hence, UTF8 can be used with all software using ASCII. Unicode numbers of 128 and above, which are less frequently used, are encoded using more bytes (24 bytes). UTF8 generally requires less storage and is compatible with ASCII. The drawback of UTF8 is more processing power needed to unpack the code due to its variable length. UTF8 is the most popular format for Unicode.
Notes:
 UTF8 uses 13 bytes for the characters in BMP (16bit), and 4 bytes for supplementary characters outside BMP (21bit).
 The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use one byte. Most European and Middle East characters use a 2byte sequence, which includes extended Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use threebyte sequences.
 All the bytes, except the 128 ASCII characters, have a leading
'1'
bit. In other words, the ASCII bytes, with a leading'0'
bit, can be identified and decoded easily.
Example: 您好 (Unicode: 60A8H 597DH)
\n\nUnicode (UCS2) is 60A8H = 0110 0000 10 101000B ⇒ UTF8 is 11100110 10000010 10101000B = E6 82 A8H Unicode (UCS2) is 597DH = 0101 1001 01 111101B ⇒ UTF8 is 11100101 10100101 10111101B = E5 A5 BDH
UTF16 (Unicode Transformation Format  16bit)
UTF16 is a variablelength Unicode character encoding scheme, which uses 2 to 4 bytes. UTF16 is not commonly used. The transformation table is as follows:
Unicode  UTF16 Code  Bytes 

xxxxxxxx xxxxxxxx

Same as UCS2  no encoding  2 
000uuuuu zzzzyyyy yyxxxxxx
uuuuu≠0 
110110ww wwzzzzyy 110111yy yyxxxxxx
(uuuuu = wwww + 1) 
4 
Take note that for the 65536 characters in BMP, the UTF16 is the same as UCS2 (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.
For BMP characters, UTF16 is the same as UCS2. For supplementary characters, each character requires a pair 16bit values, the first from the highsurrogates range, (\uD800\uDBFF
), the second from the lowsurrogates range (\uDC00\uDFFF
).
UTF32 (Unicode Transformation Format  32bit)
Same as UCS4, which uses 4 bytes for each character  unencoded.
Formats of MultiByte (e.g., Unicode) Text Files
Endianess (or byteorder): For a multibyte character, you need to take care of the order of the bytes in storage. In big endian, the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian, the most significant byte is stored at the memory location with the highest address (little byte first). For example, 您 (with Unicode number of 60A8H
) is stored as 60 A8
in big endian; and stored as A8 60
in little endian. Big endian, which produces a more readable hex dump, is more commonlyused, and is often the default.
BOM (Byte Order Mark): BOM is a special Unicode character having code number of FEFFH
, which is used to differentiate bigendian and littleendian. For bigendian, BOM appears as FE FFH
in the storage. For littleendian, BOM appears as FF FEH
. Unicode reserves these two code numbers to prevent it from crashing with another character.
Unicode text files could take on these formats:
 Big Endian: UCS2BE, UTF16BE, UTF32BE.
 Little Endian: UCS2LE, UTF16LE, UTF32LE.
 UTF16 with BOM. The first character of the file is a BOM character, which specifies the endianess. For bigendian, BOM appears as
FE FFH
in the storage. For littleendian, BOM appears asFF FEH
.
UTF8 file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF8 file as the signature to identity the file as UTF8 encoded. The BOM character (FEFFH
) is encoded in UTF8 as EF BB BF
. Adding a BOM as the first character of the file is not recommended, as it may be incorrectly interpreted in other system. You can have a UTF8 file without BOM.
Formats of Text Files
Line Delimiter or EndOfLine (EOL): Sometimes, when you use the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because different operating platforms use different character as the socalled line delimiter (or endofline or EOL). Two nonprintable control characters are involved: 0AH
(LineFeed or LF) and 0DH
(CarriageReturn or CR).
 Windows/DOS uses
OD0AH
(CR+LF, "\r\n
") as EOL.  Unixes use
0AH
(LF, "\n
") only.  Mac uses
0DH
(CR, "\r
") only.
EndofFile (EOF): [TODO]
Windows' CMD Codepage
Character encoding scheme (charset) in Windows is called codepage. In CMD shell, you can issue command "chcp"
to display the current codepage, or "chcp codepagenumber"
to change the codepage.
Take note that:
 The default codepage 437 (used in the original DOS) is an 8bit character set called Extended ASCII, which is different from Latin1 for code numbers above 127.
 Codepage 1252 (Windows1252), is not exactly the same as Latin1. It assigns code number 80H to 9FH to letters and punctuation, such as smart singlequotes and doublequotes. A commom problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows1252, but mislabelled as ISO88591.
 For internationalization and chinese character set: codepage 65001 for UTF8, codepage 1201 for UCS2BE, codepage 1200 for UCS2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.
Chinese Character Sets
Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). There are more than 20,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS2 (UTF16).
Worse still, there are also various chinese character sets, which is not compatible with Unicode:
 GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The most significant bit (MSB) of both bytes are set to 1 to coexist with 7bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312, which include more characters as well as traditional chinese characters.
 BIG5: for traditional chinese characters BIG5 also uses 2 bytes for each chinese character. The most significant bit of both bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the same code number is assigned to different character.
For example, the world is made more interesting with these many standards:
Standard  Characters  Codes  

Simplified  GB2312  和谐  BACD D0B3 
USC2  和谐  548C 8C10  
UTF8  和谐  E5928C E8B090  
Traditional  BIG5  和諧  A94D BFD3 
UCS2  和諧  548C 8AE7  
UTF8  和諧  E5928C E8ABA7 
Notes for Windows' CMD Users: To display the chinese character correctly in CMD shell, you need to choose the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS2BE, 1200 for UCS2LE, 437 for the original DOS. You can use command "chcp
" to display the current code page and command "chcp codepage_number
" to change the codepage. You also have to choose a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Raster font).
Collating Sequences (for Ranking Characters)
A string consists of a sequence of characters in upper or lower cases, e.g., "apple"
, "BOY"
, "Cat"
. In sorting or comparing strings, if we order the characters according to the underlying code numbers (e.g., USASCII) characterbycharacter, the order for the example would be "BOY"
, "apple"
, "Cat"
because uppercase letters have a smaller code number than lowercase letters. This does not agree with the socalled dictionary order, where the same uppercase and lowercase letters have the same rank. Another common problem in ordering strings is "10"
(ten) at times is ordered in front of "1"
to "9"
.
Hence, in sorting or comparison of strings, a socalled collating sequence (or collation) is often defined, which specifies the ranks for letters (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely up to you to choose a collating sequence to meet your application's specific requirements. Some caseinsensitive dictionaryorder collating sequences have the same rank for same uppercase and lowercase letters, i.e., 'A'
, 'a'
⇒ 'B'
, 'b'
⇒ ... ⇒ 'Z'
, 'z'
. Some casesensitive dictionaryorder collating sequences put the uppercase letter before its lowercase counterpart, i.e., 'A'
⇒'B'
⇒ 'C'
... ⇒ 'a'
⇒'b' ⇒
. Typically, space is ranked before digits 'c'
...'0'
to '9'
, followed by the alphabets.
Collating sequence is often language dependent, as different languages use different sets of characters (e.g., á, é, a, α) with their own orders.
For Java Programmers  java.nio.Charset
JDK 1.4 introduced a new java.nio.charset
package to support encoding/deconding of characters from UCS2 used internally in Java program to any supported charset used by external devices.
Example: The following program encodes some Unicode texts in various encoding scheme, and display the Hex codes of the encoded byte sequences.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class {
public static void main(String[] args) {
String[] charsetNames = {"USASCII", "ISO88591", "UTF8", "UTF16",
"UTF16BE", "UTF16LE", "GBK", "BIG5"};
String message = "Hi,您好!";
System.out.printf("%10s: ", "UCS2");
for (int i = 0; i < message.length(); i++) {
System.out.printf("%04X ", (int)message.charAt(i));
}
System.out.println();
for (String charsetName: charsetNames) {
Charset charset = Charset.forName(charsetName);
System.out.printf("%10s: ", charset.name());
ByteBuffer bb = charset.encode(message);
while (bb.hasRemaining()) {
System.out.printf("%02X ", bb.get());
}
System.out.println();
bb.rewind();
}
}
}
UCS2: 0048 0069 002C 60A8 597D 0021 USASCII: 48 69 2C 3F 3F 21 ISO88591: 48 69 2C 3F 3F 21 UTF8: 48 69 2C 21 UTF16: UTF16BE: UTF16LE: GBK: 48 69 2C 21 Big5: 48 69 2C 21
For Java Programmers  char
and String
The char
data type are based on the original 16bit Unicode standard called UCS2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is known as the Basic Multilingual Plane (BMP). Characters above U+FFFF are called supplementary characters. A 16bit Java char
cannot hold a supplementary character.
Recall that in the UTF16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS2. A supplementary character uses 4 bytes. and requires a pair of 16bit values, the first from the highsurrogates range, (\uD800\uDBFF
), the second from the lowsurrogates range (\uDC00\uDFFF
).
In Java, a String
is a sequences of Unicode characters. Java, in fact, uses UTF16 for String
and StringBuffer
. For BMP characters, they are the same as UCS2. For supplementary characters, each characters requires a pair of char
values.
Java methods that accept a 16bit char
value does not support supplementary characters. Methods that accept a 32bit int
value support all Unicode characters (in the lower 21 bits), including supplementary characters.
This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!
Displaying Hex Values & Hex Editors
At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a good programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Try google "Hex Editor".
I used the followings:
 NotePad++ with Hex Editor Plugin: Opensource and free. You can toggle between Hex view and Normal view by pushing the "H" button.
 PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
 TextPad: Shareware without expiration period. To view the Hex value, you need to "open" the file by choosing the file format of "binary" (??).
 UltraEdit: Shareware, not free, 30day trial only.
Let me know if you have a better choice, which is fast to launch, easy to use, can toggle between Hex and normal view, free, ....
The following Java program can be used to display hex code for Java Primitives (integer, character and floatingpoint):
public class PrintHexCode {
public static void main(String[] args) {
int i = 12345;
System.out.println("Decimal is " + i);
System.out.println("Hex is " + Integer.toHexString(i));
System.out.println("Binary is " + Integer.toBinaryString(i));
System.out.println("Octal is " + Integer.toOctalString(i));
System.out.printf("Hex is %x\n", i);
System.out.printf("Octal is %o\n", i);
char c = 'a';
System.out.println("Character is " + c);
System.out.printf("Character is %c\n", c);
System.out.printf("Hex is %x\n", (short)c);
System.out.printf("Decimal is %d\n", (short)c);
float f = 3.5f;
System.out.println("Decimal is " + f);
System.out.println(Float.toHexString(f));
f = 0.75f;
System.out.println("Decimal is " + f);
System.out.println(Float.toHexString(f));
double d = 11.22;
System.out.println("Decimal is " + d);
System.out.println(Double.toHexString(d));
}
}
In Eclipse, you can view the hex code for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Primitive Display Options ⇒ Check "Display hexadecimal values (byte, short, char, int, long)".
Exercise (Data Representation)
Give the values for the following 16bit codes, if they are representing:
 a 16bit unsigned integer;
 a 16bit signed integer;
 two 8bit unsigned integers;
 two 8bit signed integers;
 a 16bit Unicode characters;
 two 8bit ISO88591 characters.
0000 0000 0010 1010; 1000 0000 0010 1010;
Ans: (1) 42
, 32810
; (2) 42
, 32726
; (3) 0
, 42
; 128
, 42
; (4) 0
, 42
; 128
, 42
; (5) '*'
; '耪'
; (6) NUL
, '*'
; PAD
, '*'
.
Summary: Why Bother about Data Representation?
Integer number 1
, floatingpoint number 1.0
character symbol '1'
, and string "1"
are totally different inside the computer memory. You need to know the difference to write good and highperformance programs.
 In 8bit signed integer, integer number
1
is represented as00000001B
.  In 8bit unsigned integer, integer number
1
is represented as00000001B
.  In 16bit signed integer, integer number
1
is represented as00000000 00000001B
.  In 32bit signed integer, integer number
1
is represented as00000000
00000000
00000000 00000001B
.  In 32bit floatingpoint representation, number
1.0
is represented as0 01111111 0000000 00000000 00000000B
, i.e.,S=0
,E=127
,F=0
.  In 64bit floatingpoint representation, number
1.0
is represented as0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B
, i.e.,S=0
,E=1023
,F=0
.  In 8bit Latin1, the character symbol
'1'
is represented as00110001B
(or31H
).  In 16bit UCS2, the character symbol
'1'
is represented as00000000 00110001B
.  In UTF8, the character symbol
'1'
is represented as00110001B
.
If you "add" a 16bit sign integer 1
and Latin1 character '1'
or a string "1",
you could get a surprise.
References & Resources
 (FloatingPoint Number Specification) IEEE 754 (1985), "IEEE Standard for Binary FloatingPoint Arithmetic".
 (ASCII Specification) ISO/IEC 646 (1991) (or ITUT T.501992), "Information technology  7bit coded character set for information interchange".
 (LatinI Specification) ISO/IEC 88591, "Information technology  8bit singlebyte coded graphic character sets  Part 1: Latin alphabet No. 1".
 (Unicode Specification) ISO/IEC 10646, "Information technology  Universal MultipleOctet Coded Character Set (UCS)".
 Unicode Consortium @ http://www.unicode.org.
Latest Post
 Dependency injection
 Directives and Pipes
 Data binding
 HTTP Get vs. Post
 Node.js is everywhere
 MongoDB root user
 Combine JavaScript and CSS
 Inline Small JavaScript and CSS
 Minify JavaScript and CSS
 Defer Parsing of JavaScript
 Prefer Async Script Loading
 Components, Bootstrap and DOM
 What is HEAD in git?
 Show the changes in Git.
 What is AngularJS 2?
 Confidence Interval for a Population Mean
 Accuracy vs. Precision
 Sampling Distribution
 Working with the Normal Distribution
 Standardized score  Z score
 Percentile
 Evaluating the Normal Distribution
 What is Nodejs? Advantages and disadvantage?
 How do I debug Nodejs applications?
 Sync directory search using fs.readdirSync