Aug 28, 2023 8 min read Engineering

Stateful Property Testing in Rust

Introduction

Most developers have written unit tests: you feed your code some inputs and verify the outputs. Property tests automate this process. Instead of manually creating inputs, your test generates them for you and validates the input/output pairs against a set of properties you define. These can be tricky to write since you have to develop good generation strategies and find general properties to test (rather than just hardcoding individual test case results), but they can also help find edge cases you might not have thought to write unit tests for.

But what if you’re writing property tests for a system with internal state? Testing one input at a time may not be enough anymore – you might need to run whole sequences of steps to test the way the system changes state over time.

This is where stateful property testing comes in. We’ve had great successes with this technique at Readyset and we’re excited to share it with you. To that end, we’ve written and released a general-purpose OSS library, called proptest-stateful, that helps you reap the same benefits for your own projects. But before we talk more about our own experiences, let’s start by explaining the general concepts.

What is stateful property testing?

Stateful property tests generate a sequence of steps to run and then execute them one at a time, checking postconditions for each operation along the way. If a test fails, the test framework will try to remove individual steps in the test sequence to find a minimal failing case. This process is called “shrinking” and is helpful for making test failures easy to interpret. Randomly generated sequences of steps can be very long and complex, so shrinking helps separate out what actually caused the bug.

If you’ve written property tests before, you might be wondering what’s so difficult about this. Can’t you just write code to generate a step, then generate a random sequence of steps and run each one in turn?

There are a few reasons that this approach falls short: later steps in the sequence may depend on the internal state generated by earlier steps. When shrinking the generated steps to a minimal failing case, you need to be careful that you don’t accidentally create test cases that fail for the wrong reasons.

Example Use Case: Counters

Imagine you’re testing a simple library that maintains an unsigned integer counter. There are two API operations: increment and decrement. This is a contrived example, and is simplified to help illustrate the testing techniques, but in essence, this is a database.

It’s a very small memory-only database with a very strict schema, but we’re still maintaining a set of data and updating it in response to client requests, just like any other database.

struct Counter { count: usize }

impl Counter {
    fn new(count: usize) -> Self { Counter { count } }
    fn inc(&mut self) { self.count += 1; }
    fn dec(&mut self) { self.count -= 1; }
}

If you try to decrement the counter below 0, it will underflow and crash, which is a known limitation for this library. If you try to test this use case by generating a random sequence of increment/decrement operations, you will quickly run into this problem. As such, for each operation you generate, your generation code will need to consider whether it expects the counter value to be zero and if it does, make sure not to generate a decrement operation.

We can define a simple model to keep track of the expected state like so:

struct TestState { model_count: usize }

We can keep the model count updated via a next_state callback that gets run after generating each operation, and then use the model state to decide which operations we can safely generate next:

fn next_state(&mut self, op: &Self::Operation) {
  match op {
    CounterOp::Inc => self.model_count += 1,
    CounterOp::Dec => self.model_count -= 1,
  }
}

fn op_generators(&self) -> Vec<Self::OperationStrategy> {
  if self.model_count > 0 {
    vec![Just(CounterOp::Dec).boxed(), Just(CounterOp::Inc).boxed()]
  } else {
    vec![Just(CounterOp::Inc).boxed()]
  }
}

Additionally, if your test finds a bug and it tries to remove steps to find a minimum failing case, it’s possible that you might still trigger underflow in a simplified case even if the original test case did not. If the original case was “start at 0, increment, decrement, increment” then you can’t remove the first increment unless you also remove the decrement! To avoid this, we also must define preconditions via a separate callback:

preconditions_met(&self, op: &Self::Operation) -> bool {
	match op {
		CounterOp::Inc => true,
		CounterOp::Dec => self.model_count > 0,
	}
}

This logic may seem redundant with the code in op_generators, and indeed, any test case generated by op_generators that doesn’t pass the precondition test will be filtered out. However, without the logic in op_generators to avoid generating invalid test cases, it can be slow and inefficient to randomly stumble across cases that pass all the precondition checks. If you have familiarity with property testing in general, similar caveats apply here as with test-case filtering.

The upshot here is that the framework can reuse the same model during runtime, so you can also easily check the real-world counter value after each operation to verify that it matches the value you expect. For a deeper dive into this example, check out the README, or jump straight into a full implementation.

Now let’s take a look at some real-world bugs we’ve found using these techniques, many of which would’ve been nearly impossible to find with more conventional tests.

Bug Spotlight

We have a stateful property test suite that tests replication of data and DDL commands from a primary database into ReadySet’s caching system. Because ReadySet automatically keeps cached query results up-to-date in response to new data and schema changes, it’s critical to make sure that there are no corner cases where the cached data might lose synchronization with the database.

To search for edge cases, this test suite generates sequences of DDL and DML statements and runs them against both Postgres and against ReadySet, checking after each step that ReadySet’s cached data matches that of the database without ReadySet in the mix. For this test, the model state is used to make sure that the DML we generate lines up with the tables we’ve created and their corresponding schemas.

You can take a look at the source code – it’s already found a plethora of bugs since its inception earlier this year, including some really subtle edge cases that would’ve been nightmares to track down in production. Let’s dive into some examples:

Bug #1: Starting Small

CREATE TABLE t(i int);
ALTER TABLE t RENAME COLUMN i TO j;
CREATE VIEW v AS SELECT * FROM t;

This is a minimal failing case found by our test. It creates a table with a single column, renames the column, then creates a view referencing the table.

We should be able to run queries against the ‘v’ view in ReadySet, but trying to do so resulted in a cryptic error about internal invariant violations. Something about this specific combination of steps was triggering a bug.

The Fix: It turned out that there were certain types of column changes we weren’t replicating due to an overly broad filter clause in one of the internal queries ReadySet performs against PostgreSQL. Luckily, the fix turned out to be a straightforward one-line change.

Creating a view that selected from a table that had previously had a column renamed wasn’t something we’d thought to specifically test with any unit tests or integration suites. With millions of possible edge cases like this, enumerating every case via a unit test is impossible.

While stateful property testing quickly found a failure, the initial test case still had thirteen different steps. Many of the steps were unrelated and obfuscated the root cause of the bug. Shrinking saved us hours of painful debugging - the minimal case had only three steps, making it clear that the problem was with how we were replicating ALTER COLUMN commands.

Okay, simple enough, right? Next up we have this gem:

Bug #2: Ramping Up

CREATE TABLE cats (id INT);
CREATE TABLE dogs (id INT);
INSERT INTO cats VALUES (1);
ALTER TABLE dogs ADD COLUMN bark VARCHAR;
DELETE FROM cats WHERE id = 1;

Here, we created two tables named “cats” and “dogs”. We inserted a value into “cats”, added a column to”dogs”, and then deleted the value we inserted into “cats”.

When run against ReadySet in quick succession, the row of (1) would still show up in the ‘cats’ table forever. The row should have been deleted, but somehow the DELETE was getting silently dropped.

The oddest part? We’re not doing anything unusual to the cats table at all! We just create the table, insert a row, and delete a row.

This was so odd that at first it seemed like there was something wrong with the test. It turned out this bug was easy to reproduce though, and the DELETE was only lost if it happened to coincide with a separate ALTER TABLE statement on a completely unrelated table!

There weren’t any server-side errors that showed up alongside this bug, but because the test suite actively checks table contents as a postcondition of each test step, we were still able to detect that the expected state of the system failed to match up with the actual results. This would’ve been extremely difficult to diagnose in production since this bug left behind no evidence of when or why it was triggered, so generating a minimal set of steps to reproduce this was an incredible win for us.

The Fix: This edge case was being triggered by the ‘dogs’ table being altered before any rows were written to it. Because no data had been replicated to ‘dogs’, ReadySet had not yet created some internal metadata for the table. This missing table metadata fooled ReadySet into thinking it was in the midst of a full re-snapshot of all tables. ReadySet drops outdated replication events while re-snapshotting, as we are already recopying the entire table from scratch, and the deletion fell into that window. These dropped events would normally be redundant, but since ‘cats’ wasn’t actually being re-snapshotted, the events were simply lost, and the cache fell out of sync with the database.

Resolving this bug was trickier than bug #1, but we still were able to develop fixes pretty quickly (check here and here for more!).

Bug #3: Best For Last

In this case we already knew a bug existed, but we didn’t know what caused it. ReadySet was crashing when a specific user tried to run a series of database migrations, but it wasn’t clear why the crash was occurring.

This is actually a pretty common occurrence: your user has a verifiable bug, but finding a minimal repro is the hard part. Fortunately, that’s just what stateful property testing does.

We had just added ENUM support for a user and we did know this bug had something to do with ENUM types, so we added support to the test suite for creating and altering enum types. In short order, this minimal failing case showed right up:

CREATE TYPE et AS ENUM ('a');
CREATE TABLE t (e et);
DROP TABLE t;
ALTER TYPE et ADD VALUE 'b';

First, we create a custom enum type, use it in a table, then drop that table and alter the type.

Playing around with this, it became immediately clear why this was a tricky bug to reproduce. It only occurs if you alter a type after dropping a table that had used that type in its schema. The issue is that ReadySet maintains a list of associated caches for each custom type, and we had simply neglected to update that list when dropping a table. When we then altered the type, ReadySet would try to invalidate a cache that didn’t exist and crash (see here for the fix).

Hard to find by hand, but no sweat for a stateful property test!

Conclusion

I hope you’ve enjoyed this post, and I hope you’re as excited as we are about making property tests stateful.

As you can see, this is a great technique for database systems, but it can be useful for any other kind of stateful application as well. Server software with internal state can be an obvious fit, but there’s no need to limit yourself to backend server systems like this. People have used these techniques for everything from game engines to automotive software, so the sky’s the limit!

We want to share this technique with the world, so we’ve released an open source proptest-stateful crate on crates.io and in our GitHub repo, which anyone can use to write their own stateful property tests in Rust. And, of course, if writing these kinds of tests yourself sounds exciting, don’t hesitate to take a look at our careers page. Until next time!

Introduction

What is stateful property testing?

Bug Spotlight

Conclusion

Related posts

Integrating Readyset with ProxySQL for MySQL workloads

How much does Rust's bounds checking actually cost?

How are MySQL and Postgres ENUMs different?

Why hasn't caching gotten easier?