Skip to main content

UTF-32 and UTF-16 binary can contain byte-order-marks, which complicate translating to ASCII or UTF-8. Here's how to trim the BOM off the binary.
# Construct our UTF-32 string with a BOM

iex> utf32_with_bom = <<0x00, 0x00, 0xFE, 0xFF>> <> :unicode.characters_to_binary("foo", :utf8, :utf32)
<<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>

# Convert it to UTF-8, see the BOM?

iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :utf8)
"\uFEFFfoo"

# Try to convert to ASCII. Notice the error

iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :latin1)                                    
{:error, "", <<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>}

# Get the BOM byte size

iex> {_encoding, bom} = :unicode.bom_to_encoding(utf32_with_bom)                                       
{{:utf32, :big}, 4}

# Pattern-match the BOM and trimmed UTF-32 binary

iex> <<_skip::bytes-size(bom), utf32::binary>> = utf32_with_bom                                        
<<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>

# Now convert to ASCII

iex> :unicode.characters_to_binary(utf32, :utf32, :latin1)                                             
"foo"
11 upvotes

© 2021 Zest Creative, LLC