Storing and representing ciphertext
Ciphertext is not text!
In cryptography programming you have to be very careful to differentiate between binary data and what we'll refer to here as text data. 'Text' consists of readable, printable characters we expect to see on our computer screen or in a book. It might consist of simple US-ASCII/ANSI characters or it could be Unicode or DBCS oriental character strings. Text is usually stored in a string type of some kind. 'Binary' data is a string of bits that we conventionally store as bytes or octets.
- Bit strings
- Problems with bytes sequences
- Encoding in hexadecimal and base64
- Advantages of hex-encoded strings
- Converting strings to bytes and vice versa
- Input to Encryption and Decryption Processes
- Other Information
Bit strings
The input to and ciphertext output from all modern encryption functions is, strictly speaking, always a bit string. A 'bit string' is an ordered sequence of 'bits', each of value either '0' or '1'.
Most programming languages do not have a convenient 'bit string' type and so we have to work around. We usually store bit strings as a sequence of bytes each consisting of 8 bits (an 8-bit byte is sometimes referred to a an octet). So, for example, a 128-bit bit string can be stored in a 16-byte sequence of bytes (since 128/8=16).
Storing bit strings
In VB6/VBA we use an array of Byte
types
Dim abData() As Byte nLen = 16 ReDim abData(nLen - 1)
In C we use the unsigned char
type (often typedef'd as BYTE
)
unsigned char data[16];
or
unsigned char *pdata; int len = 16; pdata = (unsigned char *)malloc(len);
C# and VB.NET have the byte
and Byte
types respectively
byte[] data = new byte[16];
Dim data(16) As Byte
Problems with byte sequences
Sequences of bytes are compact but not very convenient for programmers.
- You can't print them directly - if you do you get garbage.
- They are tricky to manipulate in code.
- They are difficult to debug.
- In C, you have to specify the length with a separate variable.
- Users sometimes treat them as strings and wonder why they have problems.
Encoding in hexadecimal and base64
A more convenient form is to encode the binary sequence in hexadecimal or base64 format. These encoded forms can easily be stored in a string. Hexadecimal (hex) is particularly convenient because you can easily (well, with practice) see immediately what the value of the underlying ciphertext is. Debugging is much easier. Test vectors for encryption algorithms are usually expressed in hexadecimal form. The only real disadvantage of hex formatted data is that it takes up twice as much storage space than the decoded bytes.
Base64-encoded data is more compact than hexadecimal, but is pretty well impossible to decode by eye.
For example, the 64-bit string11111110 11011100 10111010 10011000 01110110 01010100 00110010 00010000can be represented in hex by the eight bytes
FE DC BA 98 76 54 32 10or as the hex-encoded string
"FEDCBA9876543210"In base64 this is
"/ty6mHZUMhA="
Advantages of hex-encoded strings
When programming with bytes, a lot of your programming time is spent converting from hex format into byte format and then back again for debugging and testing. If your encryption package has the option, you may as well work consistently in hex format all the time. You then only need to convert the original plaintext from 'text' into a hex-encoded string before encryption and then convert back after successful decryption.
The advantages of using hex strings include- You can store them in normal string variables, which are usually easier to manage in programs.
- You can pass them between different computer systems and in emails without corruption.
- Printing is straightforward.
- Debugging is easier as the value of each encoded byte is immediately visible.
Converting strings to bytes and vice versa
Use these functions to convert a string of text to an unambiguous array of bytes and vice versa.VB6/VBA
In VB6/VBA, use the StrConv
function.
Dim abData() As Byte Dim Str As String Dim i As Long Str = "Hello world!" ' Convert string to bytes abData = StrConv(Str, vbFromUnicode) For i = 0 To UBound(abData) Debug.Print Hex(abData(i)); "='" & Chr(abData(i)) & "'" Next ' Convert bytes to string Str = StrConv(abData, vbUnicode) Debug.Print "'" & Str & "'"
48='H' 65='e' 6C='l' 6C='l' 6F='o' 20=' ' 77='w' 6F='o' 72='r' 6C='l' 64='d' 21='!' 'Hello world!'
It gets more complicated when dealing with UTF-8-encoded strings. See How to convert VBA/VB6 Unicode strings to UTF-8.
VB.NET
In VB.NET use System.Text.Encoding
.
Dim abData() As Byte Dim Str As String Dim i As Long Str = "Hello world!" ' Convert string to bytes abData = System.Text.Encoding.Default.GetBytes(Str) For i = 0 To UBound(abData) Console.WriteLine(Hex(abData(i)) & "='" & Chr(abData(i)) & "'") Next ' Convert bytes to string Str = System.Text.Encoding.Default.GetString(abData) Console.WriteLine("'" & Str & "'")
In .NET strings are stored internally in "Unicode" format (UTF-16) and the GetBytes method can extract an array of bytes in any encoding you want.
The .Default
encoding uses the default code page on your system which is usually
1252 (Western European) but may be different on your setup.
If you want ISO-8859-1 (Latin-1) you can replace
.Default
with .GetEncoding(28591)
(code page 28591 is ISO-8859-1 which is identical to Windows-1252 except for characters in the range 0x80 to 0x9F).
Alternatively use System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(Str)
.
If you want UTF-8-encoded bytes, use System.Text.Encoding.UTF8.GetBytes(Str)
.
C#
In C#, use System.Text.Encoding
, which has identical behaviour to the function in VB.NET.
byte[] abData; string Str; int i; Str = "Hello world!"; // Convert string to bytes abData = System.Text.Encoding.Default.GetBytes(Str); for (i = 0; i < abData.Length; i++) { Console.WriteLine("{0:X}", abData[i]); } // Convert bytes to string Str = System.Text.Encoding.Default.GetString(abData); Console.WriteLine("'{0}'", Str);
C/C++
In C and C++, the distinction between a string and an array of bytes is often blurred. A string is a zero-terminated sequence ofchar
types and
bytes are stored in the unsigned char
type.
A string needs an extra character for the null terminating character;
a byte array does not, but it needs its length to be stored in a separate variable.
A byte array can can contain a zero (NUL) value but a string cannot.
#include <stdio.h> #include <string.h> #include <stdlib.h> static void pr_hexbytes(const unsigned char *bytes, int nbytes) /* Print bytes in hex format + newline */ { int i; for (i = 0; i < nbytes; i++) printf("%02X ", bytes[i]); printf("\n"); } int main() { char szStr[] = "Hello world!"; unsigned char *lpData; long nbytes; char *lpszCopy; /* Convert string to bytes */ /* (a) simply re-cast */ lpData = (unsigned char*)szStr; nbytes = strlen(szStr); pr_hexbytes(lpData, nbytes); /* (b) make a copy */ lpData = malloc(nbytes); memcpy(lpData, (unsigned char*)szStr, nbytes); pr_hexbytes(lpData, nbytes); /* Convert bytes to a zero-terminated string */ lpszCopy = malloc(nbytes + 1); memcpy(lpszCopy, lpData, nbytes); lpszCopy[nbytes] = '\0'; printf("'%s'\n", lpszCopy); free(lpData); free(lpszCopy); return 0; }
48 65 6C 6C 6F 20 77 6F 72 6C 64 21 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 'Hello world!'
The types char
and unsigned char
might be identical on your system, or they might not be.
We strongly recommend that you explictly distinguish between strings and byte arrays in your code by using
the correct type and consistently treating them differently.
Wide and multibyte characters in C
The wide character type is used to represent the Unicode character set on Windows and Linux systems.
The wchar_t
type should be 2 bytes long on Windows and 4 bytes on Linux - but don't make any assumptions
(check sizeof(wchar_t)
to find out).
Windows uses UTF-16 encoding and Linux uses UTF-32.
You could just cast directly to a sequence of bytes using (unsigned char*)wcstring
and encrypt that.
This will work if the other party has exactly the same setup as you; i.e the same size wchar_t and endianness.
If not, write a little routine that takes the byte array and converts the word sizes and/or swops the endianness.
A multibyte character type can be one or more bytes long. ASCII characters are represented by a single byte. Characters from the extended character set are represented by a sequence of two or more bytes. On Windows, MBCS characters are never more than two bytes long and special lead bytes are used to indicate the code page. Linux uses whatever is set by the current locale. Casting a multi-byte string to bytes and encrypting that will only work if the other party has the same locale. Our advice is to avoid them for anything except simple ASCII locales.
You can use the wcstombs
and mbstowcs
functions in stdlib.h
to convert between wide character and multi-byte strings.
The equivalent functions in Windows are WideCharToMultiByte
and MultiByteToWideChar
.
Here is a C program that demonstrates converting wide character strings, with the results of running a Windows and an ix86 Linux machine.
/* wchar_tests.c */ #include <stdio.h> #include <stdlib.h> #include <wchar.h> int main(void) { size_t nchars, nbytes, i; unsigned char *ptr; char *mbstring; wchar_t *wcstring; /* Make and print a wide-character string */ wcstring = L"abc"; wprintf(L"%ls\n", wcstring); /* How long is it? */ nchars = wcslen(wcstring); wprintf(L"%d characters\n", nchars); /*How big is a wchar_t? */ wprintf(L"sizeof(wchar_t)=%d\n", sizeof(wchar_t)); /* Cast to array of bytes and print */ ptr = (unsigned char*)wcstring; nbytes = nchars * sizeof(wchar_t); wprintf(L"%d bytes\n", nbytes); for (i = 0; i < nbytes; i++) { wprintf(L"%02X ", ptr[i]); } wprintf(L"\n"); /* Convert to multi-byte string */ nbytes = wcstombs(NULL, wcstring, 0); wprintf(L"%d bytes in multi-byte string\n", nbytes); mbstring = malloc(nbytes+1); nbytes = wcstombs(mbstring, wcstring, nbytes+1); /* Cast to array of bytes and print */ ptr = (unsigned char*)mbstring; for (i = 0; i < nbytes; i++) { wprintf(L"%02X ", ptr[i]); } wprintf(L"\n"); free(mbstring); return 0; }
The output on Windows
abc 3 characters sizeof(wchar_t)=2 6 bytes 61 00 62 00 63 00 3 bytes in multi-byte string 61 62 63
The output on Linux
abc 3 characters sizeof(wchar_t)=4 12 bytes 61 00 00 00 62 00 00 00 63 00 00 00 3 bytes in multi-byte string 61 62 63
Input to Encryption and Decryption Processes
The input to an encryption process must be 'binary' data, i.e. a 'bit string'.
We need to convert the text we want to encrypt into
'binary' format first and then encrypt it. The results of encryption are always binary. Do not attempt
to treat raw ciphertext as 'text' or put it directly into a String
type.
Store ciphertext either as a raw binary file or convert it to base64 or
hexadecimal format. You can safely put data in base64 or hexadecimal format in a String
.
When you decrypt, always start with binary data, decrypt to binary data, and then and only then, convert back to text, if that is what you are expecting. You can devise your own checks to make sure the decrypted ciphertext is what you expect before you do the final conversion.
On a US-English system set up for ANSI characters, you can probably get away with using a String
type to carry out 'binary' operations. We know, we have done it for years and a lot of code on this site still
has residual mistakes in it. We spend a lot of time explaining to people why their code doesn't work properly
on their Chinese/Japanese/Korean/Hebrew system.
Other information
See also our pages on:- Cross-Platform Encryption
- Encryption with International Character Sets
- Binary and byte operations in Visual Basic
- Using Byte Arrays in Visual Basic
- Encrypting variable-length strings with a password.
Contact
To comment on this page or ask a question, please send us a message.
This page last updated 20 January 2024.