Data Observability w/ Monte Carlo

Having worked on data engineering teams at a number of software companies, I’ve had one truth confirmed to me time and time again: data pipelines break. To make this even worse, the multiple hops and data paths in modern pipelines make it difficult to gain system-level visibility. Debugging broken pipelines without data visibility is like being a plumber who has to resolve sewage problems without a single access point to a house’s plumbing system.

Monte Carlo, founded in 2018 and funded by top venture capital firms like Accel, GGV, and Redpoint, builds a data observability solution giving companies this needed visibility. Using Monte Carlo, businesses can react quickly to data pipeline issues, limiting data downtime.

How It Works

Take as an example an arbitrary data pipeline. At the top of the stream, we have a transactional storage layer for application metadata. Serverless compute nodes extract data from this top layer, and the data is eventually stored in analytics engines. In a real life example, there would likely be additional data sources and maybe other intermediate stops, but I’m using the following diagram for simplicity’s sake:

Let’s say a field called customer_email in one of your table schemas is present in 95% of records, so it's missing in only 5% of records. This percentage is the same across the entire pipeline, as it probably should be since the point of the pipeline is data transformation and aggregation, not field-level mutations.

However, one day, queries running in the analytics services start showing that customer_email is missing 25% of the time. This inconsistency may go unnoticed by any human eye observing the downstream data, but Monte Carlo would notice immediately. With Monte Carlo, your data pipeline looks more like this:

Monte Carlo’s data collection layer integrates with each data store in your pipeline, extracting metadata describing qualities of the data in each location. In the example of the customer_email field, let's say a bug between the transactional storage layer and the data lake starts deleting the field in some records.

Since Monte Carlo’s data monitoring system has a wide-lens view of your pipeline over time, its algorithms would notice that the customer_email field is missing more often than previously. Monte Carlo could notify you of this discrepancy, helping you to resolve issues more quickly. Without such a system, data inconsistency problems can go unnoticed for long periods of time, and often are only noticed once the underlying data is needed for critical decision-making.

The Trend of Operational Data

In 2006, British mathematician Clive Humby coined the phrase “data is the new oil”, which quickly became a common saying in the corporate world. A more recent sub-trend is that analysts, engineers, CEOs, and others are demanding data with lower latency, so they can respond to new information more quickly and stay ahead of the competition.

Companies previously used data for more macro purposes than the purposes of today. For example, retail businesses may collect quarterly stats about shopper demographics, consolidate reports to be presented at executive meetings, and then with it they may build the company marketing plan for the next six months.

Long-cycle, offline data “pipelines” such as this are in many cases being replaced by shorter-cycle, online pipelines with tighter feedback loops. In a past job, I worked on a pipeline that looked like this:

The ML engineers that our team supported built models that were trained on a few months of data, but that were used on data no more than a few hours old. Any upstream problem in our pipeline would effect the day-to-day work of ten ML engineers in our department. Monte Carlo would describe this problem as data downtime.

Back when businesses only used data for offline reporting, a data collection or data structuring problem could be resolved offline, maybe delaying a report placed on your boss’ desk by a few hours or even a few days. In today’s world, these problems more directly result in people not being able to do their job.

The Trend Against ETL

I think that complex ETL processes are here to stay for the considerable future, but with that said, I’m intrigued by a recent trend which could reduce the need for individual companies to maintain their own pipelines and data engineering teams.

Tools like Dremio and Presto allow analysts to query data where it resides, rather than build brittle solutions to move data to an analyst-friendly environment. Other services like Stitch and Fivetran don’t directly decrease how much your data needs to move from place to place, but reduce the likelihood of mistakes happening along the way by providing flexible connectors between all kinds of data sources.

These types of tools allow companies to focus more on their own products and unlock the power of their data without having to babysit it. However for those who need the flexibility that in-house data pipelines provide and are willing to tolerate the inevitable babysitting that comes with them, I think Monte Carlo is a powerful solution that will grow quickly in the space.

Software engineer @ Rockset. I love writing about new saas products + trends in the data and infrastructure categories.