You probably happened to face some nasty tests in your continuous integration, that fails from time to time and make your build red. It slows down the deployment pipeline and could be very annoying.
In my opinion, intermittent tests could be divided into two major groups: order dependent tests and intermittent tests by themselves. I will cover both in this article.
- Order dependent tests
- Single intermittent tests
An order dependent test is a test that always passes in isolation but fails when it runs with other tests in a particular order.
Example: let’s say we have tests
A passes in isolation and passes
when we run sequence
A, B, but permanently fails in sequence
Usually, it happens when test
B does not clean up the environment properly and this in some way affects
Quite often in CI tests run in a random order. For reproduction, it’s important to know the exact order.
Let’s say you know from your CI logs that in a test sequence
A, B, C, D, E, F test
Most likely it fails because one of the preceding tests changes the global environment.
Try to run sequence
A, B, C, D, E locally to confirm the hypothesis.
Make sure that tests run exactly in the specified order. If you’re using RSpec you need to
--order defined option.
rspec --order defined ./a_spec.rb ./b_spec.rb ./c_spec.rb ./d_spec.rb ./e_spec.rb
Now how do you know which of
E break? You need to experiment, running different sequences like
D, E. If there are a lot of tests it may take long, so I prefer to use binary search.
Split preceding tests into 2 groups:
A, B and
C, D and determine which of the following sequences fail:
A, B, E or
C, D, E. Then do the same with the failing group until you get a minimal reproducible example. E.g.
rspec --order defined ./b_spec.rb ./e_spec.rb
UPDATE: As few of my readers pointed, there is rspec –bisect that does it already automatically. Thanks!
Now you need to inspect test
B to see where exactly it doesn’t clean up the environment.
Quite often it can be one of the following reasons:
- The test creates new records in the database without deleting them after.
- The test stubs some object methods (e.g.
Time.now) without reverting the change.
It often happens with
1 2 3 4
- The test creates files in the file system and does not delete them after.
- The test defines some classes that conflict with real classes from the code and it breaks autoloading mechanism in Rails.
The latest point may not be easy to understand, so let me illustrate it with an example.
Let’s say we have
DummyModule module that we want to test:
1 2 3 4 5
The test may look like the following:
1 2 3 4 5 6 7 8 9 10 11
So what’s wrong with it? It creates a new global constant named
The constant lives even when the test ends. If you define
DummyService in multiple tests they overlap
and may have side effects. Or if you have real
DummyService class in you rails app in
and you run a test sequence, where
dummy_module_spec.rb, you may get an order
DummyService is already defined in
dummy_module_spec.rb rails autoload will never try to load
app/service/dummy_service.rb file and as result
dummy_service_spec.rb will fail, because
it tests a wrong version of
DummyService (defined in
To test such modules you should prefer to use anonymous classes:
1 2 3 4 5 6 7 8 9 10 11 12 13
Such test does not pollute global environment.
Sometimes programming languages and databases have undefined behavior regarding order related operations. We, developers, may make wrong assumptions about it and introduce a bug or an intermittent test. Fortunately, those issues are often easy to spot and fix.
Most of the databases do not guarantee an order of returned items unless it’s explicitly specified in the request. You should always keep this in mind if your test relies on a specific order.
Assume we have an ActiveRecord model
User and we want to write a test for
fetch_all_users function, which returns all existing records from the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
At first glance, this test may look innocent. And it will probably pass if you try to run it.
I had to loop the test and run it about 5000 times to reproduce one single failure
(with PostgreSQL 9.5, and RSpec option
use_transactional_fixtures set to
1) fethes all existing records Failure/Error: expect(names).to eq ["Anthony", "Ahmed", "Paulo", "Max", "Ricardo"] expected: ["Anthony", "Ahmed", "Paulo", "Max", "Ricardo"] got: ["Max", "Ricardo", "Anthony", "Ahmed", "Paulo"]
There are two possible solutions to make the test stable.
First one is to modify
fetch_all_users to enforce the order of returned items:
1 2 3
The second one, if you really don’t care about the order, is to update the test to be order-agnostic. With RSpec you can use contain_exactly matcher for that. As the documentation says:
Passes if actual contains all of the expected regardless of order.
So the expectation statement becomes:
You should learn the difference between stable and unstable sorting algorithms and know which one is used by default in your programming language and your database.
Let’s take a look at an example with a stable sorting algorithm. Here we have 3 people, 2 of them have the same age. We’re gonna sort people by age.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Stable sorting algorithms retain the relative order of items with equal keys.
As you may conclude, unstable sorting algorithms are those, that do not match the definition of “stable sorting algorithm”.
However, there are 2 possible types of unstable sorting algorithms:
- Those that persist the same output for the same given input
- Those that may return different output when the same input is given
The second is not desired and must be avoided since it introduces a real randomness. An example could be a quicksort implementation with literally randomly chosen pivot.
Most of the languages have the first type of unstable sort. But it’s good to be on the alert.
By the way, if you wonder what kind of sorting algorithm has your Ruby version, I recommend you to take a look at this stackoverflow answer.
The problem is, some implementations do not guarantee order persistence on iteration over HashMap keys. For example it was the case for Ruby before version 1.9, that’s why ActiveSupport used to have OrderedHash.
Here is a little Rust program, that illustrates the issue with an equivalent Ruby code in the comments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
If you run this program multiple times, sometimes it may succeed, sometimes it fails:
thread 'main' panicked at 'assertion failed: `(left == right)` left: `[2, 1]`, right: `[1, 2]`', src/main.rs:16:4
The solution is the same as for the previous order related problems. Either to update the code to sort keys explicitly or to make the test be order agnostic:
1 2 3 4 5 6 7 8
Everything that has not 100% defined behavior may lead to the similar issues. There are few other examples:
- Iterating over entries in the file system may vary depending on a file system, operating system, file system drivers, etc.
- If you run concurrent operations they are not guaranteed to finish in the order they start. So you may want to do some kind of sorting to aggregate the final result.
Tests should not depend on current time and date.
It’s not obvious, but sometime a test may fail on CI just because it runs in a specific time in a specific (different from your local) timezone. E.g. it may fail in time frame from 20:00 to 00:00 in CI server that runs in Pacific Time Zone, but the failure may not be reproducible in Europe.
If you suspect this, the first step would be to change your local time settings in order to reproduce the same time conditions as on the CI server, when the test failed.
After you’re able to reproduce the failure locally it must be relatively easy to debug.
Another example of a test that depends on the current time:
1 2 3 4 5 6 7
Obviously, on the 1st of January 2019 it will start failing. For this test you’d need to stub the current time with Timecop:
1 2 3 4 5
We have covered the most common cases where an intermittent test can be introduced to a smooth CI process. However, some situations may be tricker and tests may fail only when multiple of the covered factors combined together.
Usually, it is better to spot problems on the code review stage, at least by now you should know what you should pay attention to.
Also it worth saying, that the article does not cover problems related to concurrency and asynchronous communication which are very big topics by themselves.
Thanks for reading please give me feedback. What was the toughest intermittent you had to debug? =)