DI Management Home > Cryptography > How to convert VBA/VB6 Unicode strings to UTF-8

How to convert VBA/VB6 Unicode strings to UTF-8


VBA/VB6 stores its strings internally in what Microsoft documentation used to call "Unicode" but should more accurately be called UTF-16. This means that each character is stored in two bytes (well, actually, some obscure characters can use more).

This page explains how to pass the information to a cryptographic operation that requires the string to be encoded in UTF-8.

New 2019-12-112019-12-11: Added VBA code basFileString.bas to read and write binary files.

New 2018-08-172018-08-17: Added changes required to run on 64-bit Office.

New 2018-08-152018-08-15: Added function Utf8BytesToString() to do the reverse and convert from UTF-8-encoded byte array to a VB string.

Go to Downloads.

Important

Rule 1. Always convert VB strings to a byte array before trying to do cryptographic operations.

Simple ASCII strings

A simple ASCII string can be converted to a byte array using the internal StrConv() function

Dim abData() As Byte
abData = StrConv(strInput, vbFromUnicode)

This stores the ASCII characters one per byte in the byte array abData. Because ASCII is a subset of UTF-8 this array is also UTF-8 encoded.

For example the 3-character ASCII string "abc" is represented by the three bytes 0x61 0x62 0x63.

Dim abData() As Byte
abData = StrConv("abc", vbFromUnicode)

Dim i As Integer
For i = 0 To UBound(abData)
    Debug.Print Hex(abData(i)) & " ";
Next
should output
61 62 63 

Problems with StrConv

If you pass a string with, say, an accented Latin character like á (U+00E1) the StrConv function will convert it using Latin-1 encoding (ISO-8859-1) to just the one byte 0xE1. This result is not UTF-8 encoded (it should be the two bytes 0xC3 0xA1).

Furthermore, if you pass, say, a Chinese character which requires more than one byte to store in UTF-16, StrConv will silently fail and just output the character as a question mark '?' (U+003F).

Converting to UTF-8

In a VBA/VB6 application, use the following code to convert a "Unicode" string to an array of bytes encoded in UTF-8.

''' WinApi function that maps a UTF-16 (wide character) string to a new character string
Private Declare Function WideCharToMultiByte Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As Long, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As Long, _
    ByVal cbMultiByte As Long, _
    ByVal lpDefaultChar As Long, _
    ByVal lpUsedDefaultChar As Long) As Long
    
' CodePage constant for UTF-8
Private Const CP_UTF8 = 65001

''' Return byte array with VBA "Unicode" string encoded in UTF-8
Public Function Utf8BytesFromString(strInput As String) As Byte()
    Dim nBytes As Long
    Dim abBuffer() As Byte
    ' Catch empty or null input string
    Utf8BytesFromString = vbNullString
    If Len(strInput) < 1 Then Exit Function
    ' Get length in bytes *including* terminating null
    nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, 0&, 0&, 0&, 0&)
    ' We don't want the terminating null in our byte array, so ask for `nBytes-1` bytes
    ReDim abBuffer(nBytes - 2)  ' NB ReDim with one less byte than you need
    nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&)
    Utf8BytesFromString = abBuffer
End Function

2018-07-27: Thanks to Guillermo R. Flook for pointing out that the original code did not accept an empty or null string and for suggesting a correction.
2018-08-15: Added Utf8BytesToString() function to do the reverse and convert from UTF-8-encoded byte array to a VB string.
2018-11-06: Replaced "vbNull" in 5th argument 7th line in function above with "0&".

Examples

The Excel spreadsheet utf8-tests.xls (zipped, 18.1 kB) has samples of international characters in Spanish, Japanese, Chinese and Hebrew, and some ASCII characters. It contains two Visual basic code modules UtfTests.bas which carries out tests on the strings using the Utf8BytesFromString() function in the module basUtf8FromString.bas (see Downloads).

utf8-tests

The results of running the tests are shown here.

Results

String# characters# bytes UTF-8UTF-8 bytesNote
"abc123"6661 62 63 31 32 331
"áéíóñ"510C3 A1 C3 A9 C3 AD C3 B3 C3 B12
Japanese515E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF3
Chinese15254F 55 3D E7 B8 BD E5 B1 80 2C 43 3D E4 B8 AD E5 9C 8B 2C 43 4E 3D E6 9C AC4
Hebrew121561 62 63 20 D7 9B D7 A9 D7 A8 20 66 31 32 335

Notes

  1. Each ASCII character is encoded in one byte, e.g.
    LATIN SMALL LETTER A (U+0061)  => 61
    DIGIT THREE (U+0033)           => 33
    
  2. The accented latin characters will print in the VB immediate window and are encoded in two bytes, e.g.
    LATIN SMALL LETTER A WITH ACUTE (U+00E1) => C3 A1
    LATIN SMALL LETTER N WITH TILDE (U+00F1) => C3 B1
    
  3. The Japanese Hiragana characters print as '?' and are encoded in three bytes, e.g.
    HIRAGANA LETTER KO (U+3053) => E3 81 93
    
  4. The Chinese characters print as '?' and are encoded in three bytes, e.g.
    Han character 'ben' (U+672C) => E6 9C AC
    
  5. The Hebrew characters print as '?' and are encoded in two bytes. The characters are displayed left-to-right as RESH-SHIN-KAF but are stored in the correct right-to-left order:
    HEBREW LETTER KAF  (U+05DB) => D7 9B
    HEBREW LETTER SHIN (U+05E9) => D7 A9
    HEBREW LETTER RESH (U+05E8) => D7 A8
    

Converting from UTF-8-encoded byte array to a VB string

Use the Utf8BytesToString() function to do the reverse.

''' Maps a character string to a UTF-16 (wide character) string
Private Declare Function MultiByteToWideChar Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpMultiByteStr As Long, _
    ByVal cchMultiByte As Long, _
    ByVal lpWideCharStr As Long, _
    ByVal cchWideChar As Long _
    ) As Long

' CodePage constant for UTF-8
Private Const CP_UTF8 = 65001

''' Return length of byte array or zero if uninitialized
Private Function BytesLength(abBytes() As Byte) As Long
    ' Trap error if array is uninitialized
    On Error Resume Next
    BytesLength = UBound(abBytes) - LBound(abBytes) + 1
End Function

''' Return VBA "Unicode" string from byte array encoded in UTF-8
Public Function Utf8BytesToString(abUtf8Array() As Byte) As String
    Dim nBytes As Long
    Dim nChars As Long
    Dim strOut As String
    Utf8BytesToString = ""
    ' Catch uninitialized input array
    nBytes = BytesLength(abUtf8Array)
    If nBytes <= 0 Then Exit Function
    ' Get number of characters in output string
    nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, 0&, 0&)
    ' Dimension output buffer to receive string
    strOut = String(nChars, 0)
    nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, StrPtr(strOut), nChars)
    Utf8BytesToString = Left$(strOut, nChars)
End Function

Here's a quick test for some accented Spanish characters. You can just type these into your code.

Public Sub Test_Utf8String()
    Dim b() As Byte
    Dim s As String
    b = Utf8BytesFromString("áéíóñ")
    s = Utf8BytesToString(b)
    Debug.Print "[" & s & "]"
End Sub
This should result in the output
[áéíóñ]

Changes required to run on 64-bit Office

The only changes required to run on a 64-bit version of Office are in the WinApi declarations and are as follows:

Private Declare PtrSafe Function WideCharToMultiByte Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As LongPtr, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As LongPtr, _
    ByVal cbMultiByte As Long, _
    ByVal lpDefaultChar As Long, _
    ByVal lpUsedDefaultChar As Long) As Long

Private Declare PtrSafe Function MultiByteToWideChar Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpMultiByteStr As LongPtr, _
    ByVal cchMultiByte As Long, _
    ByVal lpWideCharStr As LongPtr, _
    ByVal cchWideChar As Long _
    ) As Long
You can use the VBA7 compiler constant to write code that will work on both 32-bit and 64-bit versions of Office.

See the updated code basUtf8FromString64.bas.

Note that this may display one set of code in red in your IDE:
#If VBA7 Then
Private Declare PtrSafe Function WideCharToMultiByte Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As LongPtr, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As LongPtr, _
    ByVal cbMultiByte As Long, _
    ByVal lpDefaultChar As Long, _
    ByVal lpUsedDefaultChar As Long _
    ) As Long
#Else 
Private Declare Function WideCharToMultiByte Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As Long, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As Long, _
    ByVal cbMultiByte As Long, _
    ByVal lpDefaultChar As Long, _
    ByVal lpUsedDefaultChar As Long _
    ) As Long 
#End If
but this will be ignored by the compiler when it runs. It just looks unsightly.

Note also that we are referring to the 64-bit version of Office, not the Windows platform you are running on. You do not need the Win64 complier constant and you definitely do not use the LongLong data type in your code.

See also the Microsoft page 64-Bit Visual Basic for Applications Overview.

Downloads

basUtf8FromString64.bas (zipped, last updated 2017-08-17)
basFileString.bas (zipped, last updated 2019-12-11)

Related Topics

See also our pages on:

Contact

For more information or to comment on this page, please send us a message.

This page first published 30 June 2015. Last updated 6 April 2021