Hidden cost of string indexing in Ruby
Serhii Potapov December 21, 2020 #rubyAnd probably in many other modern languages
Ruby provides the String#[]
method which is very convenient to get a char at a given index or to obtain a substring.
But have you ever thought what is the complexity of indexing a char in a String?
If you come from a classical C background you may think of strings like an array of chars.
So indexing a string must have the same complexity as indexing an array, which is O(1)
.
Usually, the following formula is used to get a memory address of a particular item in an array:
BASE_ADDR + ITEM_SIZE * INDEX
Where:
BASE_ADDR
- memory address, where an array startsITEM_SIZE
- the size of an array item in bytesINDEX
- index of an element that we are interested in
So it's clearly O(1)
. But let's do a benchmark.
Let's build a very long string and try to index different chars.
= 100_000
buenos_dias = # 20 chars
text = buenos_dias * 10_000 # 200_000 chars
Benchmark.bm(20) do
x.report() do
N.times { text[10] }
end
x.report() do
N.times { text[100_000] }
end
x.report() do
N.times { text[200_000] }
end
end
The result may be surprising:
user system total real
text[10]: 0.012850 0.000020 0.012870 ( 0.012875)
text[100_000]: 1.800117 0.000000 1.800117 ( 1.800136)
text[200_000]: 3.587110 0.000000 3.587110 ( 3.587150)
Indexing text[200_000]
takes 2 times longer than text[100_000]
.
So the complexity is actually not O(1)
but rather O(n)
.
But why? Some of our reasoning above is wrong. By default Ruby strings are UTF-8 strings and a UTF-8 string is not an array of chars, but rather a sequence of chars. The size of each particular char in UTF-8 is dynamic and it varies from 1 byte (for ASCII) to 4 bytes.
Since char size is not fixed, the formula that is used to calculate a memory address of a particular element in memory
is not applicable. The only way that remains for Ruby to get its job done is to iterate over a string char by char until
it reaches a char with the desired index. That is linear search with O(n)
complexity.
Pure ASCII strings
In the example above we used a Spanish greeting, that has a diacritic í
.
All the other chars are valid ASCII chars which occupy 1 byte each, but í
takes 2 bytes:
.bytesize # => 2
For the sake of the experiment let's replace í
with i
and rerun the benchmark.
buenos_dias = # 20 chars
text = buenos_dias * 10_000 # 200_000 chars
# rest of the code is the same
The benchmark is executed immediately:
user system total real
text[10]: 0.010061 0.000149 0.010210 ( 0.010228)
text[100_000]: 0.011963 0.000000 0.011963 ( 0.011983)
text[200_000]: 0.010153 0.000000 0.010153 ( 0.010157)
Oh. Now we access all the indices more or less in equal time, meaning we get our O(1)
complexity back.
This allows us to build the following hypothesis: Ruby tracks string meta-information regarding the presence of non-ASCII characters and depending on that applies different strategies to index a string:
- Sequential search if a string contains non-ASCII chars
- Memory address calculation if a string contains ASCII chars only
We may go forward and check the hypothesis by taking a look into
string.c
implementation in the MRI ruby codebase.
We find a comment that describes String flags:
/* FLAGS of RString
*
* 1: RSTRING_NOEMBED
* 2: STR_SHARED (== ELTS_SHARED)
* 2-6: RSTRING_EMBED_LEN (5 bits == 32)
* 5: STR_SHARED_ROOT (RSTRING_NOEMBED==1 && STR_SHARED == 0, there may be
* other strings that rely on this string's buffer)
* 6: STR_BORROWED (when RSTRING_NOEMBED==1 && klass==0, unsafe to recycle
* early, specific to rb_str_tmp_frozen_{acquire,release})
* 7: STR_TMPLOCK
* 8-9: ENC_CODERANGE (2 bits)
* 10-16: ENCODING (7 bits == 128)
* 17: RSTRING_FSTR
* 18: STR_NOFREE
* 19: STR_FAKESTR
*/
The flag that interests us is ENC_CODERANGE (2 bits)
. If I got it correctly those 2 bits may represent the following 4 values:
ENC_CODERANGE_7BIT
ENC_CODERANGE_VALID
ENC_CODERANGE_BROKEN
ENC_CODERANGE_UNKNOWN
ENC_CODERANGE_7BIT
is exactly the case when all chars are 1 byte each.
Conclusion
I know, the complexity of String#[]
is not something that an average ruby developer worries about,
because their attention is usually focused on much more pragmatic domain problems.
However, I found it always interesting to know a little bit more about the interpreter internals. And I hope the details we discovered in this article can be beneficial for the performance of your software if you'll once happen to process a big amount of text.