How to convert VBA/VB6 Unicode strings to UTF-8
VBA/VB6 stores its strings internally in what Microsoft documentation used to call "Unicode" but should more accurately be called UTF-16. This means that each character is stored in two bytes (well, actually, some obscure characters can use more).
This page explains how to pass the information to a cryptographic operation that requires the string to be encoded in UTF-8.
The first rule is that you should always convert VB strings to a byte array before trying to do cryptographic operations.
Simple ASCII strings
A simple ASCII string can be converted to a byte array using the internal
Dim abData() As Byte abData = StrConv(strInput, vbFromUnicode)
This stores the ASCII characters one per byte in the byte array
abData. Because ASCII is a subset of UTF-8
this array is also UTF-8 encoded.
For example the 3-character ASCII string
"abc" is represented by the three bytes
0x61 0x62 0x63.
Dim abData() As Byte abData = StrConv("abc", vbFromUnicode) Dim i As Integer For i = 0 To UBound(abData) Debug.Print Hex(abData(i)) & " "; Nextshould output
61 62 63
Problems with StrConv
If you pass a string with, say, an accented Latin character like
á (U+00E1) the StrConv function will
convert it using Latin-1 encoding (ISO-8859-1) to just the one byte
This result is not UTF-8 encoded (it should be the two bytes
Furthermore, if you pass, say, a Chinese character which requires more than one byte to store in UTF-16, StrConv will silently fail and just output the character as a question mark '?' (U+003F).
Converting to UTF-8
In a VBA/VB6 application, use the following code to convert a "Unicode" string to an array of bytes encoded in UTF-8.
''' WinApi function that maps a UTF-16 (wide character) string to a new character string Private Declare Function WideCharToMultiByte Lib "kernel32" ( _ ByVal CodePage As Long, _ ByVal dwFlags As Long, _ ByVal lpWideCharStr As Long, _ ByVal cchWideChar As Long, _ ByVal lpMultiByteStr As Long, _ ByVal cbMultiByte As Long, _ ByVal lpDefaultChar As Long, _ ByVal lpUsedDefaultChar As Long) As Long ' CodePage constant for UTF-8 Private Const CP_UTF8 = 65001 ''' Return byte array with VBA "Unicode" string encoded in UTF-8 Public Function Utf8BytesFromString(strInput As String) As Byte() Dim nBytes As Long Dim abBuffer() As Byte ' Get length in bytes *including* terminating null nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, vbNull, 0&, 0&, 0&) ' We don't want the terminating null in our byte array, so ask for `nBytes-1` bytes ReDim abBuffer(nBytes - 2) ' NB ReDim with one less byte than you need nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&) Utf8BytesFromString = abBuffer End Function
The Excel spreadsheet utf8-tests.xls (zipped, 20kB) has samples of international characters in Spanish, Japanese, Chinese and Hebrew, and some ASCII characters. It contains two Visual basic code modules UtfTests.bas which carries out tests on the strings using the Utf8BytesFromString() function in the module basUtf8FromString.bas.
The results of running the tests are shown here.
|String||# characters||# bytes UTF-8||UTF-8 bytes||Note|
|"abc123"||6||6||61 62 63 31 32 33||1|
|"áéíóñ"||5||10||C3 A1 C3 A9 C3 AD C3 B3 C3 B1||2|
|Japanese||5||15||E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF||3|
|Chinese||15||25||4F 55 3D E7 B8 BD E5 B1 80 2C 43 3D E4 B8 AD E5 9C 8B 2C 43 4E 3D E6 9C AC||4|
|Hebrew||12||15||61 62 63 20 D7 9B D7 A9 D7 A8 20 66 31 32 33||5|
- Each ASCII character is encoded in one byte, e.g.
LATIN SMALL LETTER A (U+0061) => 61 DIGIT THREE (U+0033) => 33
- The accented latin characters will print in the VB immediate window and are encoded in two bytes, e.g.
LATIN SMALL LETTER A WITH ACUTE (U+00E1) => C3 A1 LATIN SMALL LETTER N WITH TILDE (U+00F1) => C3 B1
- The Japanese Hiragana characters print as '?' and are encoded in three bytes, e.g.
HIRAGANA LETTER KO (U+3053) => E3 81 93
- The Chinese characters print as '?' and are encoded in three bytes, e.g.
Han character 'ben' (U+672C) => E6 9C AC
- The Hebrew characters print as '?' and are encoded in two bytes.
The characters are displayed left-to-right as
RESH-SHIN-KAFbut are stored in the correct right-to-left order:
HEBREW LETTER KAF (U+05DB) => D7 9B HEBREW LETTER SHIN (U+05E9) => D7 A9 HEBREW LETTER RESH (U+05E8) => D7 A8
- Cryptography with International Character Sets
- Cross-Platform Encryption
- Storing and representing ciphertext
- Using Byte Arrays in Visual Basic VB6
- Binary and byte operations in Visual Basic VB6
For more information or to comment on this page, please send us a message.
This page last updated 15 February 2016