Storing and representing ciphertext

Ciphertext is not text!

In cryptography programming you have to be very careful to differentiate between binary data and what we'll refer to here as text data. 'Text' consists of readable, printable characters we expect to see on our computer screen or in a book. It might consist of simple US-ASCII/ANSI characters or it could be Unicode or DBCS oriental character strings. Text is usually stored in a string type of some kind. 'Binary' data is a string of bits that we conventionally store as bytes or octets.

Bit strings

The input to and ciphertext output from all modern encryption functions is, strictly speaking, always a bit string. A `bit string' is an ordered sequence of `bits', each of value either `0' or `1'.

Most programming languages do not have a convenient `bit string' type and so we have to work around. We usually store bit strings as a sequence of bytes each consisting of 8 bits (an 8-bit byte is sometimes referred to a an octet). So, for example, a 128-bit bit string can be stored in a 16-byte sequence of bytes (since 128/8=16).

Storing bit strings

In VB we use an array of Byte types

Dim abData() As Byte
nLen = 16
ReDim abData(nLen - 1)

In C we use the unsigned char type (often typedef'd as BYTE)

unsigned char data[16];
or
unsigned char *pdata;
int len = 16;
pdata = (unsigned char *)malloc(len);

C# and VB.NET have the byte and Byte types respectively

byte[] data = new byte[16];
Dim data(16) As Byte

Problems with byte sequences

Sequences of bytes are compact but not very convenient for programmers.

The only really convenient way to store an array of bytes is as a binary file.

Encoding in hexadecimal and base64

A more convenient form is to encode the binary sequence in hexadecimal or base64 format. These encoded forms can easily be stored in a string. Hexadecimal (hex) is particularly convenient because you can easily (well, with practice) see immediately what the value of the underlying ciphertext is. Debugging is much easier. Test vectors for encryption algorithms are usually expressed in hexadecimal form. The only real disadvantage of hex formatted data is that it takes up twice as much storage space than the decoded bytes.

Base64-encoded data is more compact than hexadecimal, but is pretty well impossible to decode by eye.

For example, the 64-bit string
11111110 11011100 10111010 10011000 
01110110 01010100 00110010 00010000
can be represented in hex by the eight bytes
FE DC BA 98 76 54 32 10
or as the hex-encoded string
"FEDCBA9876543210"
In base64 this is
"/ty6mHZUMhA="

Advantages of hex-encoded strings

When programming with bytes, a lot of your programming time is spent converting from hex format into byte format and then back again for debugging and testing. If your encryption package has the option, you may as well work consistently in hex format all the time. You then only need to convert the original plaintext from `text' into a hex-encoded string before encryption and then convert back after successful decryption.

The advantages of using hex strings include Base64 strings have similar advantages except debugging is less convenient. On the downside, hex-encoded strings use twice as much storage as the underlying byte-encoded binary data, more if the strings are stored in Unicode format. Base64 strings expand the binary data by about four-thirds.

Converting strings to bytes and vice versa

Use these functions to convert a string of text to an unambiguous array of bytes and vice versa.

VB6/VBA

In VB6/VBA, use the StrConv function.

Dim abData() As Byte
Dim Str As String
Dim i As Long
Str = "Hello world!"
' Convert string to bytes
abData = StrConv(Str, vbFromUnicode)
For i = 0 To UBound(abData)
    Debug.Print Hex(abData(i)); "='" & Chr(abData(i)) & "'"
Next
' Convert bytes to string
Str = StrConv(abData, vbUnicode)
Debug.Print "'" & Str & "'"
48='H'
65='e'
6C='l'
6C='l'
6F='o'
20=' '
77='w'
6F='o'
72='r'
6C='l'
64='d'
21='!'
'Hello world!'

VB.NET

In VB.NET use System.Text.Encoding.

Dim abData() As Byte
Dim Str As String
Dim i As Long
Str = "Hello world!"
' Convert string to bytes
abData = System.Text.Encoding.Default.GetBytes(Str)
For i = 0 To UBound(abData)
    Console.WriteLine(Hex(abData(i)) & "='" & Chr(abData(i)) & "'")
Next
' Convert bytes to string
Str = System.Text.Encoding.Default.GetString(abData)
Console.WriteLine("'" & Str & "'")
You could be more explicit by replacing .Default with .GetEncoding(1252), and then use the appropriate code page for your character set (1252 is Western European).

C#

In C#, use System.Text.Encoding, which has identical behaviour to the function in VB.NET.

byte[] abData;
string Str;
int i;
Str = "Hello world!";
// Convert string to bytes
abData = System.Text.Encoding.Default.GetBytes(Str);
for (i = 0; i < abData.Length; i++)
{
	Console.WriteLine("{0:X}", abData[i]);
}
// Convert bytes to string
Str = System.Text.Encoding.Default.GetString(abData);
Console.WriteLine("'{0}'", Str);

C/C++

In C and C++, the distinction between a string and an array of bytes is often blurred. A string is a zero-terminated sequence of char types and bytes are stored in the unsigned char type. A string needs an extra character for the null terminating character; a byte array does not, but it needs its length to be stored in a separate variable. A byte array can can contain a zero (NUL) value but a string cannot.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static void pr_hexbytes(const unsigned char *bytes, int nbytes)
/* Print bytes in hex format + newline */
{
   int i;
   for (i = 0; i < nbytes; i++)
   	printf("%02X ", bytes[i]);
   printf("\n");
}

int main()
{
   char szStr[] = "Hello world!";
   unsigned char *lpData;
   long nbytes;
   char *lpszCopy;

   /* Convert string to bytes */
   /* (a) simply re-cast */
   lpData = (unsigned char*)szStr;
   nbytes = strlen(szStr);
   pr_hexbytes(lpData, nbytes);

   /* (b) make a copy */
   lpData = malloc(nbytes);
   memcpy(lpData, (unsigned char*)szStr, nbytes);
   pr_hexbytes(lpData, nbytes);

   /* Convert bytes to a zero-terminated string */
   lpszCopy = malloc(nbytes + 1);
   memcpy(lpszCopy, lpData, nbytes);
   lpszCopy[nbytes] = '\0';
   printf("'%s'\n", lpszCopy);

   free(lpData);
   free(lpszCopy);

   return 0;
}
48 65 6C 6C 6F 20 77 6F 72 6C 64 21
48 65 6C 6C 6F 20 77 6F 72 6C 64 21
'Hello world!'

The types char and unsigned char might be identical on your system, or they might not be. We strongly recommend that you explictly distinguish between strings and byte arrays in your code by using the correct type and consistently treating them differently.

Unicode strings in C

If your string is a Unicode string, then it consists of a sequence of wchar_t types, which are usually 2 bytes long, but may be longer. Converting wide-character strings to a sequence of bytes in C is more problematic. You can either copy the Unicode string directly to a string of bytes (in which case every second byte will be zero for US-ASCII characters), or use the stdlib wcstombs function or the Windows WideCharToMultiByte function to convert to a sequence of multi-byte characters (some will be one byte long, some two or more) and then convert the multi-byte string to bytes (you can do this with a simple cast). Each party encrypting and decrypting must agree on which way to do it.

Input to Encryption and Decryption Processes

The input to an encryption process must be 'binary' data, i.e. a `bit string'. We need to convert the text we want to encrypt into `binary' format first and then encrypt it. The results of encryption are always binary. Do not attempt to treat raw ciphertext as `text' or put it directly into a String type. Store ciphertext either as a raw binary file or convert it to base64 or hexadecimal format. You can safely put data in base64 or hexadecimal format in a String.

When you decrypt, always start with binary data, decrypt to binary data, and then and only then, convert back to text, if that is what you are expecting. You can devise your own checks to make sure the decrypted ciphertext is what you expect before you do the final conversion.

On a US-English system set up for ANSI characters, you can probably get away with using a String type to carry out `binary' operations. We know, we have done it for years and a lot of code on this site still has residual mistakes in it. We spend a lot of time explaining to people why their code doesn't work properly on their Chinese/Japanese/Korean/Hebrew system.

Other information

See also our pages on:

This document last updated 25 February 2007.

To comment on this Contact DI Management.    Return to the Cryptography page.    [Top]Return to top of page