Hidden cost of string indexing in Ruby

Serhii Potapov December 21, 2020 #ruby

And probably in many other modern languages

Ruby provides the String#[] method which is very convenient to get a char at a given index or to obtain a substring.

But have you ever thought what is the complexity of indexing a char in a String?

If you come from a classical C background you may think of strings like an array of chars. So indexing a string must have the same complexity as indexing an array, which is O(1). Usually, the following formula is used to get a memory address of a particular item in an array:

BASE_ADDR + ITEM_SIZE * INDEX

Where:

BASE_ADDR - memory address, where an array starts
ITEM_SIZE - the size of an array item in bytes
INDEX - index of an element that we are interested in

So it's clearly O(1). But let's do a benchmark. Let's build a very long string and try to index different chars.

require 'benchmark'

N = 100_000

buenos_dias = "Buenos días amigos. "   # 20 chars
text = buenos_dias * 10_000            # 200_000 chars

Benchmark.bm(20) do |x|
  x.report("text[10]: ") do
    N.times { text[10] }
  end

  x.report("text[100_000]: ") do
    N.times { text[100_000] }
  end

  x.report("text[200_000]: ") do
    N.times { text[200_000] }
  end
end

The result may be surprising:

                           user     system      total        real
text[10]:              0.012850   0.000020   0.012870 (  0.012875)
text[100_000]:         1.800117   0.000000   1.800117 (  1.800136)
text[200_000]:         3.587110   0.000000   3.587110 (  3.587150)

Indexing text[200_000] takes 2 times longer than text[100_000]. So the complexity is actually not O(1) but rather O(n).

But why? Some of our reasoning above is wrong. By default Ruby strings are UTF-8 strings and a UTF-8 string is not an array of chars, but rather a sequence of chars. The size of each particular char in UTF-8 is dynamic and it varies from 1 byte (for ASCII) to 4 bytes.

Since char size is not fixed, the formula that is used to calculate a memory address of a particular element in memory is not applicable. The only way that remains for Ruby to get its job done is to iterate over a string char by char until it reaches a char with the desired index. That is linear search with O(n) complexity.

Pure ASCII strings

In the example above we used a Spanish greeting, that has a diacritic í. All the other chars are valid ASCII chars which occupy 1 byte each, but í takes 2 bytes:

"í".bytesize # => 2

For the sake of the experiment let's replace í with i and rerun the benchmark.

buenos_dias = "Buenos dias amigos. "   # 20 chars
text = buenos_dias * 10_000            # 200_000 chars

# rest of the code is the same

The benchmark is executed immediately:

                           user     system      total        real
text[10]:              0.010061   0.000149   0.010210 (  0.010228)
text[100_000]:         0.011963   0.000000   0.011963 (  0.011983)
text[200_000]:         0.010153   0.000000   0.010153 (  0.010157)

Oh. Now we access all the indices more or less in equal time, meaning we get our O(1) complexity back.

This allows us to build the following hypothesis: Ruby tracks string meta-information regarding the presence of non-ASCII characters and depending on that applies different strategies to index a string:

Sequential search if a string contains non-ASCII chars
Memory address calculation if a string contains ASCII chars only

We may go forward and check the hypothesis by taking a look into string.c implementation in the MRI ruby codebase.

We find a comment that describes String flags:

/* FLAGS of RString
 *
 * 1:     RSTRING_NOEMBED
 * 2:     STR_SHARED (== ELTS_SHARED)
 * 2-6:   RSTRING_EMBED_LEN (5 bits == 32)
 * 5:     STR_SHARED_ROOT (RSTRING_NOEMBED==1 && STR_SHARED == 0, there may be
 *                         other strings that rely on this string's buffer)
 * 6:     STR_BORROWED (when RSTRING_NOEMBED==1 && klass==0, unsafe to recycle
 *                      early, specific to rb_str_tmp_frozen_{acquire,release})
 * 7:     STR_TMPLOCK
 * 8-9:   ENC_CODERANGE (2 bits)
 * 10-16: ENCODING (7 bits == 128)
 * 17:    RSTRING_FSTR
 * 18:    STR_NOFREE
 * 19:    STR_FAKESTR
 */

The flag that interests us is ENC_CODERANGE (2 bits). If I got it correctly those 2 bits may represent the following 4 values:

ENC_CODERANGE_7BIT
ENC_CODERANGE_VALID
ENC_CODERANGE_BROKEN
ENC_CODERANGE_UNKNOWN

ENC_CODERANGE_7BIT is exactly the case when all chars are 1 byte each.

Conclusion

I know, the complexity of String#[] is not something that an average ruby developer worries about, because their attention is usually focused on much more pragmatic domain problems.

However, I found it always interesting to know a little bit more about the interpreter internals. And I hope the details we discovered in this article can be beneficial for the performance of your software if you'll once happen to process a big amount of text.

Back to top