Data Quality Monitoring Tools

Data quality monitoring tools are essential for ensuring the accuracy and reliability of your data. With so many options on the market, it can be challenging to know which one to choose. In this post, we will compare five popular but different data quality monitoring tools: Soda, Great Expectations (GE), Re_data, Monte Carlo, and LightUp.

Open Source

Soda, GE, and Re_data are all open-source tools, while Monte Carlo and LightUp are not.

All the tools are based on Python, except for Monte Carlo, which doesn't specify its base.


Data Sources - In Memory

Soda uses Spark for in-memory data sources, while GE uses pandas and Spark. Re_data doesn't specify, and Monte Carlo and LightUp don't support in-memory data sources.


Data Sources - Database/Lake

Soda supports athena, redshift, bigquery, postgresql, snowflake, while GE supports athena, bigquery, mssql, mysql, postgresql, redshift, snowflake, sqlite, and trino. Re_data supports dbt, and Monte Carlo supports snowflake, redshit, bigquery, athena, databricks with BI integration with Looker, Tableau, etc. LightUp supports athena, bigquery, databricks, postgres, oracle, and others.


Streaming

Only GE supports mini-batch streaming.


Interact From Code

All tools except Monte Carlo allow interacting with code, with different approaches.


Managed By Code

All tools except LightUp can be managed by code, with different approaches.


Dashboard

Soda is cloud-based and doesn't require configuration, while GE can be self-hosted, and the report can be served in a blob storage. Re_data can be cloud-based (paid) or self-hosted. Monte Carlo is cloud-based, and LightUp can be hosted on the UI.

In terms of visualization, Soda has a time-series graph of every result and health percentage, while GE doesn't have any visual graphs. Re_data has a time-series graph per lineage, Monte Carlo has different visualizations for test results and a catalog, and LightUp shows data delay.

Soda shows failed rows and easy-to-see other tables, while GE shows failed values but not rows. Re_data has compiled SQL tests that are quite similar to Soda, and Monte Carlo has a lot of information, maybe too much. LightUp doesn't show failed rows.

For dataset management, Soda allows assigning a dataset owner, but setting more detailed permissions aren't possible. GE doesn't allow any dataset management. Re_data supports adding an owner, while Monte Carlo and LightUp allow dataset management from the UI.


Tests

Soda and Re_data use sodaCL and SQL, respectively, while GE uses python (with the option of using SQLAlchemy for custom expectations) and SQL rules. Monte Carlo uses SQL rules, and LightUp uses SQL.

Another important aspect of data quality monitoring tools is how easy they are to use for non-engineers. In this regard, Soda stands out with its user-friendly UI that allows creating monitors with ease. LightUp also offers a simple UI for creating tests, while Monte Carlo and Re_data fall in the medium difficulty range. GE, on the other hand, requires a good level of technical expertise to use.

Historic metrics are another crucial factor for data quality monitoring tools. Soda, LightUp, and Monte Carlo all offer the capability to track historic metrics over time, which is essential for trend analysis and detecting data quality issues that may arise over time. Re_data allows users to save results, but does not provide an automated historic metrics tracking feature. Unfortunately, GE does not store scan results, making it difficult to compare results over time.


Notification

Soda and GE support email and slack notifications, while Re_data only supports email. Monte Carlo supports slack, teams, webhook, etc., and LightUp supports slack, teams, etc.


Incident management

In terms of incident management, Soda and Monte Carlo offer features for incident management, making it easier to track and resolve issues as they arise. LightUp also offers incident management features, but Re_data and GE do not.


Pricing

Finally, pricing is an important consideration when selecting a data quality monitoring tool. Soda is an open source, so you can use it for free, but in case you use Soda Cloud, the price depends on license and the size of the users. For a small team with 10 users or less, it is fairly affordable, while larger licenses can cost a few times more. Monte Carlo is more expensive than Soda as it provides fully managed services. LightUp do not disclose their pricing publicly but it is assumed to be a bit less or similar to Monte Carlo. GE is an open-source tool and therefore free to use.


In conclusion, each data quality monitoring tool has its own strengths and weaknesses, and the best tool for your needs will depend on your specific use case and requirements. Soda and Monte Carlo are both strong options for incident management and historic metrics tracking, while LightUp offers excellent BI tool integration. Re_data is a solid choice for users who rely heavily on dbt, while GE is a good option for those on a tight budget and with the technical expertise to fully utilize an open-source tool.

Comments