Working with Encodings in Ruby 1.9
(Last updated on 11.2.2013)
I won’t go too much into the details here, instead covering what I think are the important parts. If you want more information about this topic, I heavily recommend The Pragmatic Programmers’ Programming Ruby 1.9, which dedicates a whole chapter to this topic.
M17n in Ruby 1.9
One of the big changes in Ruby 1.9 is the multilingualization (a.k.a. m17n) support.
Previously, support for anything other than ASCII encoded strings has been, well, somewhat lackluster to say the least. Ruby 1.8 always assumes that the characters in strings are exactly one byte, which invariable leads to problems when working with multibyte encodings, among other things. By setting $KCODE
you could add some support for various japanese encodings or UTF-8, but even then there were some problems that made working with such strings awkward.
Many other languages solve this problem by using one encoding internally for all their strings. Python, for example, uses Unicode for all strings, which means it has to transcode strings in other encodings (e.g. Shift JIS, an encoding used for japanese characters) before being able to work with them. Ruby 1.9, however, supports per-string character encodings, as well as support for converting between those encodings. Practically this means you can assign each string its own encoding, and Ruby will automatically do things like calculating the length of a string correctly.
Encodings are represented by an instance of Encoding
. Each instance represents a specific encoding, like UTF-8. You can get a list of all built-in encodings by calling Encoding.list
. As of Ruby 1.9.1-RC1, this list contains 83 different encodings, which probably covers 99% of the encoding needs you will ever have.
Source File Encodings
However, Ruby not only supports encoding for strings (and regular expressions of course), it also supports different encodings for source code files as well as stuff you read from and write to I/O streams. This means you can now use umlauts in your variable names, as long as you properly declare what encoding your source code is in:
In this example, I set the encoding inside the source code file itself by using a magic comment. There’s a few different ways of using these, but basically Ruby scans for a comment on the first or second line (to accommodate shebangs) that contains the string “encoding:
” followed by the name of the encoding, so even magic comments like Emacs’ # -*- encoding: utf-8 -*-
will work. In addition, Ruby also looks for an UTF-8 Byte Order Mark at the beginning of the file, and automatically sets the encoding accordingly. If nothing of those is specified, Ruby will default to ASCII.
By setting the encoding, Ruby also allows non-ascii identifiers, so I can alias lambda
to the actual greek lambda character. Also notice that it correctly gives the length of the string as 2, whereas Ruby 1.8 would have printed 4, since Ruby 1.9 automatically uses the current source file encoding for any string literals within that file.
Note that the above example exhibits what I consider bad behavior on the programmers side. Non-ASCII identifiers like Lambda or Pi might be understood by most programmers, but using your native language & characters for variable and method names makes it basically impossible for anyone who doesn’t speak your language to understand your code. So unless you have a very good reason, stick to using english names, just as you used to in Ruby 1.8.
The source file encoding is set for each file individually, so using a library that uses Shift JIS while your source files use UTF-8 is no problem at all (again, as long as everybody properly declares their encoding).
String Encodings
As I already mentioned, each string (and regular expression, which for the purposes of this post are the same thing) has its own encoding, which you can access with the String#encoding
method (__ENCODING__
returns the current source file encoding):
>> __ENCODING__.name
=> "UTF-8"
>> str = "Rüby"
=> "Rüby"
>> str.encoding.name
=> "UTF-8"
If you want to transcode the string into a different encoding, use String#encode
:
>> str_in_western_iso = str.encode("iso-8859-1")
=> "R######
Since my shell is set up for UTF-8, ISO-8859-1 encoded strings not only won’t show up correctly, they also mess with my output (bastards!). But by looking at the bytes for both strings, we can see that str
was properly transcoded:
>> str.bytes.to_a
=> [82, 195, 188, 98, 121]
>> str_in_western_iso.bytes.to_a
=> [82, 252, 98, 121]
Transcoding to an encoding that doesn’t support all characters in your string will of course fail:
>> str_in_ascii = str.encode("us-ascii")
Encoding::UndefinedConversionError: "\xC3\xBC" from UTF-8 to US-ASCII
from (irb):29:in `encode'
from (irb):29
from /usr/local/bin/irb19:12:in `<main>'
However, you can specify placeholder characters for such cases. You can also force an encoding onto a string by using String#force_encoding
, which changes its encoding without changing the underlying bytes. Also note that String#encode
has many more options for transcoding. Refer to the Ruby docs or Programming Ruby 1.9 for more info.
External encoding
External data comes in many different encodings as well, and you’ll be glad to hear that Ruby has support for that too.
Just like Strings, IO objects also have encodings associated with them, but unlike Strings, IO objects have the concept of an external encoding. The external encoding describes how the data is actually encoded.
First, a quick example:
>> f = File.open("example.txt")
=> #<File:example.txt>
>> f.external_encoding.name
=> "UTF-8"
>> content = f.read
=> "This file contains only ASCII characters."
>> content.encoding.name
=> "UTF-8"
The external encoding defaults to whatever the $LANG
environment variable is set to - in my case de_AT.UTF-8
. However, it can also be affected by the $LC_ALL
variable as well (thanks Dan Hensgen).
However, not every File you’re going to read is going to use the default external encoding, so you can override it when calling File.open
. Since example.txt
only contains ASCII characters, we’ll want to use ASCII encoding when reading it:
>> f = File.open("example.txt", "r:ascii")
=> #<File:example.txt>
>> f.external_encoding.name
=> "US-ASCII"
>> content = f.read
=> "This file contains only ASCII characters."
>> content.encoding.name
=> "US-ASCII"
Here, I told Ruby to use ASCII as the external encoding for example.txt
, and any data read from it automatically gets the same encoding.
Internal encoding
Often the external encoding won’t match the encoding we want to use internally, like UTF-8. So when we read a file that contains ISO-8859-1 characters, the strings may have the correct encoding associated with them, but we can’t print them properly, since the shell expects UTF-8 characters:
>> f = File.open("iso-8859-1.txt", "r:iso-8859-1")
=> #<File:iso-8859-1.txt>
>> f.external_encoding.name
=> "ISO-8859-1"
>> content = f.read
=> "This file contains umlauts: ###"
>> content.encoding.name
=> "ISO-8859-1"
We can solve this by specifying our internal encoding when opening the file:
>> f = File.open("iso-8859-1.txt", "r:iso-8859-1:utf-8")
=> #<File:iso-8859-1.txt>
>> f.external_encoding.name
=> "ISO-8859-1"
>> content = f.read
=> "This file contains umlauts: äöü"
>> content.encoding.name
=> "UTF-8"
This causes Ruby to automatically transcode the data when reading, thus yielding proper output.
The only way you can set the default internal (and external) encoding is via the -E
-switch, e.g. ruby -E iso-8859-1:utf-8
. This would set the default external encoding to ISO-8859-1 and the default internal to UTF-8. You don’t have to specify both, -E iso-9959-1
would only specify the default external encoding, and -E :utf-8
only the default internal encoding.
Binary files
All this encoding stuff has one other effect: You can now no longer just read binary files like you used to in Ruby 1.8. To force Ruby to read a file as binary data, either specify the b
flag or use the binary
encoding when opening files (Windows-Users will already be familiar with the b
-flag). This sets the external encoding to ASCII-8BIT, ensuring binary data gets read correctly:
>> f = File.open("example.txt", "rb")
=> #<File:example.txt>
>> f.external_encoding.name
=> "ASCII-8BIT"
>> f.read.encoding.name
=> "ASCII-8BIT"
>> f.close
=> nil
>> f = File.open("example.txt", "r:binary")
=> #<File:example.txt>
>> f.external_encoding.name
=> "ASCII-8BIT"
>> f.read.encoding.name
=> "ASCII-8BIT"
Another potential trap is when you’re working with strings that have incompatible encodings. If you try to run a regular expression encoded in UTF-8 against an ISO-8859-1 encoded string, you’ll get an exception. There’s a few ways to handle this, but the best way is to enforce one encoding you use internally (I recommend UTF-8), and transcode any other string. This can be achieved in various ways, but setting a proper default internal encoding is probably the easiest way.