DI Management Home > Cryptography > How to convert VBA/VB6 Unicode strings to UTF-8

How to convert VBA/VB6 Unicode strings to UTF-8


VBA/VB6 stores its strings internally in what Microsoft documentation used to call "Unicode" but should more accurately be called UTF-16. This means that each character is stored in two bytes (well, actually, some obscure characters can use more).

This page explains how to pass the information to a cryptographic operation that requires the string to be encoded in UTF-8.

Important

The first rule is that you should always convert VB strings to a byte array before trying to do cryptographic operations.

Simple ASCII strings

A simple ASCII string can be converted to a byte array using the internal StrConv() function

Dim abData() As Byte
abData = StrConv(strInput, vbFromUnicode)

This stores the ASCII characters one per byte in the byte array abData. Because ASCII is a subset of UTF-8 this array is also UTF-8 encoded.

For example the 3-character ASCII string "abc" is represented by the three bytes 0x61 0x62 0x63.

Dim abData() As Byte
abData = StrConv("abc", vbFromUnicode)

Dim i As Integer
For i = 0 To UBound(abData)
    Debug.Print Hex(abData(i)) & " ";
Next
should output
61 62 63 

Problems with StrConv

If you pass a string with, say, an accented Latin character like á (U+00E1) the StrConv function will convert it using Latin-1 encoding (ISO-8859-1) to just the one byte 0xE1. This result is not UTF-8 encoded (it should be the two bytes 0xC3 0xA1).

Furthermore, if you pass, say, a Chinese character which requires more than one byte to store in UTF-16, StrConv will silently fail and just output the character as a question mark '?' (U+003F).

Converting to UTF-8

In a VBA/VB6 application, use the following code to convert a "Unicode" string to an array of bytes encoded in UTF-8.

''' WinApi function that maps a UTF-16 (wide character) string to a new character string
Private Declare Function WideCharToMultiByte Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As Long, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As Long, _
    ByVal cbMultiByte As Long, _
    ByVal lpDefaultChar As Long, _
    ByVal lpUsedDefaultChar As Long) As Long
    
' CodePage constant for UTF-8
Private Const CP_UTF8 = 65001

''' Return byte array with VBA "Unicode" string encoded in UTF-8
Public Function Utf8BytesFromString(strInput As String) As Byte()
    Dim nBytes As Long
    Dim abBuffer() As Byte
    ' Get length in bytes *including* terminating null
    nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, vbNull, 0&, 0&, 0&)
    ' We don't want the terminating null in our byte array, so ask for `nBytes-1` bytes
    ReDim abBuffer(nBytes - 2)  ' NB ReDim with one less byte than you need
    nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&)
    Utf8BytesFromString = abBuffer
End Function

Examples

The Excel spreadsheet utf8-tests.xls (zipped, 20kB) has samples of international characters in Spanish, Japanese, Chinese and Hebrew, and some ASCII characters. It contains two Visual basic code modules UtfTests.bas which carries out tests on the strings using the Utf8BytesFromString() function in the module basUtf8FromString.bas.

utf8-tests

The results of running the tests are shown here.

Results

String# characters# bytes UTF-8UTF-8 bytesNote
"abc123"6661 62 63 31 32 331
"áéíóñ"510C3 A1 C3 A9 C3 AD C3 B3 C3 B12
Japanese515E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF3
Chinese15254F 55 3D E7 B8 BD E5 B1 80 2C 43 3D E4 B8 AD E5 9C 8B 2C 43 4E 3D E6 9C AC4
Hebrew121561 62 63 20 D7 9B D7 A9 D7 A8 20 66 31 32 335

Notes

  1. Each ASCII character is encoded in one byte, e.g.
    LATIN SMALL LETTER A (U+0061)  => 61
    DIGIT THREE (U+0033)           => 33
    
  2. The accented latin characters will print in the VB immediate window and are encoded in two bytes, e.g.
    LATIN SMALL LETTER A WITH ACUTE (U+00E1) => C3 A1
    LATIN SMALL LETTER N WITH TILDE (U+00F1) => C3 B1
    
  3. The Japanese Hiragana characters print as '?' and are encoded in three bytes, e.g.
    HIRAGANA LETTER KO (U+3053) => E3 81 93
    
  4. The Chinese characters print as '?' and are encoded in three bytes, e.g.
    Han character 'ben' (U+672C) => E6 9C AC
    
  5. The Hebrew characters print as '?' and are encoded in two bytes. The characters are displayed left-to-right as RESH-SHIN-KAF but are stored in the correct right-to-left order:
    HEBREW LETTER KAF  (U+05DB) => D7 9B
    HEBREW LETTER SHIN (U+05E9) => D7 A9
    HEBREW LETTER RESH (U+05E8) => D7 A8
    

Related Topics

See also our pages on:

Contact

For more information or to comment on this page, please send us a message.

This page last updated 15 February 2016