We are thrilled to welcome Vladimir Rodionov to the ReadySet Engineering team! He has spent over two decades solving difficult data-related problems at companies like Cloudera, Hortonworks and CarrierIQ. Additionally, he is the founder of CarrotDB and BigBase.org. At Readyset, he is a member of the Dataflow team helping us perfect the Readyset product.
We asked Vlad to share more details about his past work and why he’s excited to be at ReadySet.
Can you give us a brief overview of the data-related problems you’ve worked on previously?
With pleasure! There have been many of them, some work-related, others on the open source side.
I am an Apache committer to the Apache Ratis project, which is the Java implementation of the Raft consensus protocol. In addition, I have been a long-term contributor to the Apache HBase project, which is a NoSQL database designed based on the Google BigTable architecture.
I’ve been working in the Big Data/Hadoop space for more than a decade within different companies: CarrierIQ, SpliceMachine, HortonWorks, and Cloudera.
So far, my most significant HBase contributions are incremental data backup support in HBase and distributed medium-sized objects compaction. The incremental backup feature in HBase allows users to create full and/or incremental data snapshots and store them somewhere in cloud storage which allows them to restore data quickly on demand. To the best of my knowledge, this is the only implementation of incremental backup among distributed NoSQL databases so far.
The distributed MOB (medium size objects) compaction work began as a severe data loss bug for one of our past customers, which resulted in a complete re-engineering of the overall data compaction process. We moved from single-server, single thread compaction to a distributed parallel compaction framework which drastically improved performance and stability.
I also work on my own open source projects:
- BigBase: a read-optimized version of HBase. BigBase introduced an in-memory row cache (ScanCache in BigTable) and improved performance significantly on read-heavy workloads. One of the novel features of Row Cache in BigBase is in memory data compression. Row Cache supports data compression in memory, using different compression algorithms. It can do it more efficiently than other data caches because it groups cached objects together and compresses them by groups. This works much better than compressing objects separately, especially when the object size is small.
- Velociraptor: a hierarchical caching solution for distributed OLAP query engine Presto. Velociraptor replaces RaptorX, the current caching solution in Presto. It has several significant advantages over RaptorX: it does not use Java heap memory to keep data - only off-heap, it can cache data on fast SSD disks, and it is much more scalable.
All my projects can be found on my GitHub page.
What brought you to ReadySet? Coming from someone who has been thinking hard about this problem for years now, what about ReadySet’s approach makes it stand out?
Database caching is important for several reasons:
- Improved performance: Caching helps to improve the performance of a database by reducing the amount of time it takes to retrieve data. When data is frequently accessed, it is stored in the cache memory. Subsequent requests for the same data can be served from the cache, which is faster than retrieving it from the disk.
- Reduced database load: By reducing the number of times a database has to access disk, caching can help reduce the load on the database. This can lead to better overall system performance and a more responsive user experience.
- Scalability: Caching can also help to improve the scalability of a database by reducing the need for additional hardware. By caching frequently accessed data, the database can handle more concurrent users and transactions without requiring additional resources.
- Cost savings: Caching can help to reduce the cost of running a database by reducing the need for expensive hardware and software licenses. By optimizing the performance of the existing hardware, caching can help to extend the life of the hardware and reduce the need for expensive upgrades.
However, caching database query results can lead to issues when the underlying data in the database changes, as the cached data may no longer be accurate or relevant. Therefore, we need to invalidate cached data properly.
There are two major strategies for database cache invalidation: time-based invalidation, where cached data is automatically removed after a set amount of time, and event-based invalidation, where cached data is invalidated when a specific event occurs in the database, such as an update or deletion. This usually happens on a table level due to overall complexity of finer granularity support - every update or delete operation on a database table invalidates all caches for this particular table.
As you might expect, both approaches are sub-optimal, but for very different reasons. Time-based invalidation provides good performance, but no data consistency guarantee - there is a high chance that stale data will be served, sometimes very stale. On the contrary, event-based invalidation on a table level provides data consistency but poor performance, because the cache is invalidated very often, on every mutation operation which hits a table.
With traditional caching approaches, you have to make a decision between data consistency and performance.
ReadySet solves this dilemma and offers users both: consistency (eventual, but with a very short convergence time) and performance. It is an event-based type of data invalidation but with a very fine granularity. Data will be invalidated only when it MUST be invalidated - cached query results will be automatically updated only when they become irrelevant and stale, exactly as that happens, at the time of impacting writes.
ReadySet’s caching technology is based on a solid mathematical foundation and, in theory, can be applied not only to SQL but to other data query languages as well (not solely SQL based).
I got so interested in working at ReadySet because I saw the opportunity to apply my knowledge in data caching directly. I’m excited to be a team that is creating an impact on a space that I am deeply passionate about!