DI Management Home > Cryptography > Storing and representing ciphertext

# Storing and representing ciphertext

Ciphertext is not text!

In cryptography programming you have to be very careful to differentiate between binary data and what we'll refer to here as text data. 'Text' consists of readable, printable characters we expect to see on our computer screen or in a book. It might consist of simple US-ASCII/ANSI characters or it could be Unicode or DBCS oriental character strings. Text is usually stored in a string type of some kind. 'Binary' data is a string of bits that we conventionally store as bytes or octets.

## Bit strings

The input to and ciphertext output from all modern encryption functions is, strictly speaking, always a bit string. A `bit string' is an ordered sequence of `bits', each of value either `0' or `1'.

Most programming languages do not have a convenient `bit string' type and so we have to work around. We usually store bit strings as a sequence of bytes each consisting of 8 bits (an 8-bit byte is sometimes referred to a an octet). So, for example, a 128-bit bit string can be stored in a 16-byte sequence of bytes (since 128/8=16).

### Storing bit strings

In VB6/VBA we use an array of `Byte` types

```Dim abData() As Byte
nLen = 16
ReDim abData(nLen - 1)
```

In C we use the `unsigned char` type (often typedef'd as `BYTE`)

```unsigned char data[16];
```
or
```unsigned char *pdata;
int len = 16;
pdata = (unsigned char *)malloc(len);
```

C# and VB.NET have the `byte` and `Byte` types respectively

```byte[] data = new byte[16];
```
```Dim data(16) As Byte
```

## Problems with byte sequences

Sequences of bytes are compact but not very convenient for programmers.

• You can't print them directly - if you do you get garbage.
• They are tricky to manipulate in code.
• They are difficult to debug.
• In C, you have to specify the length with a separate variable.
• Users sometimes treat them as strings and wonder why they have problems.
The only really convenient way to store an array of bytes is as a binary file.

## Encoding in hexadecimal and base64

A more convenient form is to encode the binary sequence in hexadecimal or base64 format. These encoded forms can easily be stored in a string. Hexadecimal (hex) is particularly convenient because you can easily (well, with practice) see immediately what the value of the underlying ciphertext is. Debugging is much easier. Test vectors for encryption algorithms are usually expressed in hexadecimal form. The only real disadvantage of hex formatted data is that it takes up twice as much storage space than the decoded bytes.

Base64-encoded data is more compact than hexadecimal, but is pretty well impossible to decode by eye.

For example, the 64-bit string
```11111110 11011100 10111010 10011000
01110110 01010100 00110010 00010000
```
can be represented in hex by the eight bytes
`FE DC BA 98 76 54 32 10`
or as the hex-encoded string
`"FEDCBA9876543210"`
In base64 this is
`"/ty6mHZUMhA="`

When programming with bytes, a lot of your programming time is spent converting from hex format into byte format and then back again for debugging and testing. If your encryption package has the option, you may as well work consistently in hex format all the time. You then only need to convert the original plaintext from `text' into a hex-encoded string before encryption and then convert back after successful decryption.

The advantages of using hex strings include
• You can store them in normal string variables, which are usually easier to manage in programs.
• You can pass them between different computer systems and in emails without corruption.
• Printing is straightforward.
• Debugging is easier as the value of each encoded byte is immediately visible.
Base64 strings have similar advantages except debugging is less convenient. On the downside, hex-encoded strings use twice as much storage as the underlying byte-encoded binary data, more if the strings are stored in Unicode format. Base64 strings expand the binary data by about four-thirds.

## Converting strings to bytes and vice versa

Use these functions to convert a string of text to an unambiguous array of bytes and vice versa.

### VB6/VBA

In VB6/VBA, use the `StrConv` function.

```Dim abData() As Byte
Dim Str As String
Dim i As Long
Str = "Hello world!"
' Convert string to bytes
abData = StrConv(Str, vbFromUnicode)
For i = 0 To UBound(abData)
Debug.Print Hex(abData(i)); "='" & Chr(abData(i)) & "'"
Next
' Convert bytes to string
Str = StrConv(abData, vbUnicode)
Debug.Print "'" & Str & "'"
```
```48='H'
65='e'
6C='l'
6C='l'
6F='o'
20=' '
77='w'
6F='o'
72='r'
6C='l'
64='d'
21='!'
'Hello world!'
```

### VB.NET

In VB.NET use `System.Text.Encoding`.

```Dim abData() As Byte
Dim Str As String
Dim i As Long
Str = "Hello world!"
' Convert string to bytes
abData = System.Text.Encoding.Default.GetBytes(Str)
For i = 0 To UBound(abData)
Console.WriteLine(Hex(abData(i)) & "='" & Chr(abData(i)) & "'")
Next
' Convert bytes to string
Str = System.Text.Encoding.Default.GetString(abData)
Console.WriteLine("'" & Str & "'")
```

In .NET strings are stored internally in "Unicode" format (UTF-16) and the GetBytes method can extract an array of bytes in any encoding you want.

The `.Default` encoding uses the default code page on your system which is usually 1252 (Western European) but may be different on your setup. If you want ISO-8859-1 (Latin-1) you can replace `.Default` with `.GetEncoding(28591)` (code page 28591 is ISO-8859-1 which is identical to Windows-1252 except for characters in the range 0x80 to 0x9F). Alternatively use `System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(Str)`. If you want UTF-8-encoded bytes, use `System.Text.Encoding.UTF8.GetBytes(Str)`.

### C#

In C#, use `System.Text.Encoding`, which has identical behaviour to the function in VB.NET.

```byte[] abData;
string Str;
int i;
Str = "Hello world!";
// Convert string to bytes
abData = System.Text.Encoding.Default.GetBytes(Str);
for (i = 0; i < abData.Length; i++)
{
Console.WriteLine("{0:X}", abData[i]);
}
// Convert bytes to string
Str = System.Text.Encoding.Default.GetString(abData);
Console.WriteLine("'{0}'", Str);
```

### C/C++

In C and C++, the distinction between a string and an array of bytes is often blurred. A string is a zero-terminated sequence of `char` types and bytes are stored in the `unsigned char` type. A string needs an extra character for the null terminating character; a byte array does not, but it needs its length to be stored in a separate variable. A byte array can can contain a zero (NUL) value but a string cannot.
```#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static void pr_hexbytes(const unsigned char *bytes, int nbytes)
/* Print bytes in hex format + newline */
{
int i;
for (i = 0; i < nbytes; i++)
printf("%02X ", bytes[i]);
printf("\n");
}

int main()
{
char szStr[] = "Hello world!";
unsigned char *lpData;
long nbytes;
char *lpszCopy;

/* Convert string to bytes */
/* (a) simply re-cast */
lpData = (unsigned char*)szStr;
nbytes = strlen(szStr);
pr_hexbytes(lpData, nbytes);

/* (b) make a copy */
lpData = malloc(nbytes);
memcpy(lpData, (unsigned char*)szStr, nbytes);
pr_hexbytes(lpData, nbytes);

/* Convert bytes to a zero-terminated string */
lpszCopy = malloc(nbytes + 1);
memcpy(lpszCopy, lpData, nbytes);
lpszCopy[nbytes] = '\0';
printf("'%s'\n", lpszCopy);

free(lpData);
free(lpszCopy);

return 0;
}
```
```48 65 6C 6C 6F 20 77 6F 72 6C 64 21
48 65 6C 6C 6F 20 77 6F 72 6C 64 21
'Hello world!'
```

The types `char` and `unsigned char` might be identical on your system, or they might not be. We strongly recommend that you explictly distinguish between strings and byte arrays in your code by using the correct type and consistently treating them differently.

### Wide and multibyte characters in C

The wide character type is used to represent the Unicode character set on Windows and Linux systems. The `wchar_t` type should be 2 bytes long on Windows and 4 bytes on Linux - but don't make any assumptions (check `sizeof(wchar_t)` to find out). Windows uses UTF-16 encoding and Linux uses UTF-32.

You could just cast directly to a sequence of bytes using `(unsigned char*)wcstring` and encrypt that. This will work if the other party has exactly the same setup as you; i.e the same size wchar_t and endianness. If not, write a little routine that takes the byte array and converts the word sizes and/or swops the endianness.

A multibyte character type can be one or more bytes long. ASCII characters are represented by a single byte. Characters from the extended character set are represented by a sequence of two or more bytes. On Windows, MBCS characters are never more than two bytes long and special lead bytes are used to indicate the code page. Linux uses whatever is set by the current locale. Casting a multi-byte string to bytes and encrypting that will only work if the other party has the same locale. Our advice is to avoid them for anything except simple ASCII locales.

You can use the `wcstombs` and `mbstowcs` functions in `stdlib.h` to convert between wide character and multi-byte strings. The equivalent functions in Windows are `WideCharToMultiByte` and `MultiByteToWideChar`.

Here is a C program that demonstrates converting wide character strings, with the results of running a Windows and an ix86 Linux machine.

```/* wchar_tests.c */

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(void)
{
size_t nchars, nbytes, i;
unsigned char *ptr;
char *mbstring;
wchar_t *wcstring;

/* Make and print a wide-character string  */
wcstring = L"abc";
wprintf(L"%ls\n", wcstring);

/* How long is it? */
nchars = wcslen(wcstring);
wprintf(L"%d characters\n", nchars);

/*How big is a wchar_t? */
wprintf(L"sizeof(wchar_t)=%d\n", sizeof(wchar_t));

/* Cast to array of bytes and print */
ptr = (unsigned char*)wcstring;
nbytes = nchars * sizeof(wchar_t);
wprintf(L"%d bytes\n", nbytes);
for (i = 0; i < nbytes; i++)
{
wprintf(L"%02X ", ptr[i]);
}
wprintf(L"\n");

/* Convert to multi-byte string */
nbytes = wcstombs(NULL, wcstring, 0);
wprintf(L"%d bytes in multi-byte string\n", nbytes);
mbstring = malloc(nbytes+1);
nbytes = wcstombs(mbstring, wcstring, nbytes+1);

/* Cast to array of bytes and print */
ptr = (unsigned char*)mbstring;
for (i = 0; i < nbytes; i++)
{
wprintf(L"%02X ", ptr[i]);
}
wprintf(L"\n");

free(mbstring);
return 0;
}
```

The output on Windows

```abc
3 characters
sizeof(wchar_t)=2
6 bytes
61 00 62 00 63 00
3 bytes in multi-byte string
61 62 63
```

The output on Linux

```abc
3 characters
sizeof(wchar_t)=4
12 bytes
61 00 00 00 62 00 00 00 63 00 00 00
3 bytes in multi-byte string
61 62 63
```

## Input to Encryption and Decryption Processes

The input to an encryption process must be 'binary' data, i.e. a `bit string'. We need to convert the text we want to encrypt into `binary' format first and then encrypt it. The results of encryption are always binary. Do not attempt to treat raw ciphertext as `text' or put it directly into a `String` type. Store ciphertext either as a raw binary file or convert it to base64 or hexadecimal format. You can safely put data in base64 or hexadecimal format in a `String`.

When you decrypt, always start with binary data, decrypt to binary data, and then and only then, convert back to text, if that is what you are expecting. You can devise your own checks to make sure the decrypted ciphertext is what you expect before you do the final conversion.

On a US-English system set up for ANSI characters, you can probably get away with using a `String` type to carry out `binary' operations. We know, we have done it for years and a lot of code on this site still has residual mistakes in it. We spend a lot of time explaining to people why their code doesn't work properly on their Chinese/Japanese/Korean/Hebrew system.