David Bernheisel
Convert UTF-32/16 with BOMs to Latin1
UTF-32 and UTF-16 binary can contain byte-order-marks, which complicate translating to ASCII or UTF-8. Here's how to trim the BOM off the binary.
# Construct our UTF-32 string with a BOM
iex> utf32_with_bom = <<0x00, 0x00, 0xFE, 0xFF>> <> :unicode.characters_to_binary("foo", :utf8, :utf32)
<<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>
# Convert it to UTF-8, see the BOM?
iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :utf8)
"\uFEFFfoo"
# Try to convert to ASCII. Notice the error
iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :latin1)
{:error, "", <<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>}
# Get the BOM byte size
iex> {_encoding, bom} = :unicode.bom_to_encoding(utf32_with_bom)
{{:utf32, :big}, 4}
# Pattern-match the BOM and trimmed UTF-32 binary
iex> <<_skip::bytes-size(bom), utf32::binary>> = utf32_with_bom
<<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>
# Now convert to ASCII
iex> :unicode.characters_to_binary(utf32, :utf32, :latin1)
"foo"
11
upvotes