Elixir Tips

Elixir Tips

  • UTF-32 and UTF-16 binary can contain byte-order-marks, which complicate translating to ASCII or UTF-8. Here's how to trim the BOM off the binary.
    # Construct our UTF-32 string with a BOM
    
    iex> utf32_with_bom = <<0x00, 0x00, 0xFE, 0xFF>> <> :unicode.characters_to_binary("foo", :utf8, :utf32)
    <<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>
    
    # Convert it to UTF-8, see the BOM?
    
    iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :utf8)
    "\uFEFFfoo"
    
    # Try to convert to ASCII. Notice the error
    
    iex> :unicode.characters_to_binary(utf32_with_bom, :utf32, :latin1)                                    
    {:error, "", <<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>}
    
    # Get the BOM byte size
    
    iex> {_encoding, bom} = :unicode.bom_to_encoding(utf32_with_bom)                                       
    {{:utf32, :big}, 4}
    
    # Pattern-match the BOM and trimmed UTF-32 binary
    
    iex> <<_skip::bytes-size(bom), utf32::binary>> = utf32_with_bom                                        
    <<0, 0, 254, 255, 0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>
    
    # Now convert to ASCII
    
    iex> :unicode.characters_to_binary(utf32, :utf32, :latin1)                                             
    "foo"
    11 upvotes