Fullscript Logo
Data Engineering

Data Stack Modernization

Author

Amit Jain - Profile Picture
Amit Jain

Date Published

Share this post

Date: Jan 10th, 2022

Author: Amit Jain

Reading time: 8min

How to rapidly scale and simplify a traditional data stack. This is about Fullscript’s journey towards modernization of their data stack and the thought process behind it.

This is the first post in the series as we learn and move through the various stages of data stack modernization.

Data Stack refers to the core tooling applied by the Data Engineers to collect the data from various application sources and make it available in the desired format ready for analytics.

Here we cover the outline of a data stack, the traditional components, and limitations of the same. After this, we discuss how to overcome the limitations and missing components of a traditional data stack using the approach of a Modern Data Stack (MDS) along with other pieces of the puzzle.

This post will benefit someone looking for an overall solution using emerging trends, spanning most of the business applications expected from a data system or data warehouse, in the fastest speed of implementation with security considerations, with minimal resources, and out-of-the-box SaaS solutions.

Traditional Data Stack Limitations

The following is a typical representation of a data stack which is either an initial implementation in a startup or a legacy implementation in a mature organization.

Image from: Data Stack Modernization

Traditional data stack has several shortcomings making it hard to scale and manage with limited resources which include:

  • Over dependence on the Data Engineers (traditionally called ETL developers) from ingestion, transformation, modeling and, making data available for reporting.
  • Data ingestion from each new source requires an understanding of the source data model and schemas resulting in loss of time in knowledge transfer to the Data Engineers and custom development for each data source.
  • Waterfall SDLC approach and long delivery lead times
  • Data lineage is hard to trace as it requires looking through the code making it complex to justify KPIs presented in the reports and doing impact analysis of the changes in data models of the source systems.
  • Data extracts from a data warehouse going back to the applications are typically frowned upon by the data teams due to the manual effort in maintenance and lack of knowledge of the App db schema requirements.
  • Data warehouse storage and compute are tied together hence making it impossible to elastically scale as per the demand.

Modernized Data Stack

The following components are introduced in the modernized data stack:

Image from: Data Stack Modernization
  • Data Loader (Extract Load → EL): Replace traditional (Extract Transform Load → ETL) tooling with SaaS based data loader component with prebuilt connectors and out-of-the-box knowledge of application schemas. Companies like Fivetran and Stitch are strong commercial players in this field. Airbyte is an upcoming option with both open source and licensed versions. This makes the data ingestion from new sources extremely simple and results in the reduction in development efforts of months vs minutes using the data loader tooling.
  • Cloud Data Warehouse: Storage and compute are separated allowing you to elastically scale as per the demand. Zero copy clone and spin up multiple environments as per the requirements to develop, test and deploy the changes. Some great examples in this category include Snowflake and Databricks.
  • Data Build Tool: This can be used to build complex data models using just SQL language enabling non-engineers such as reporting and data analysts to work with the tooling. This enables a concept of Analytics Engineering with built-in testing for data quality and online searchable data catalog and lineage. There are reusable macros and an overall shockingly low learning curve. The company that offers this tool is aptly called dbt. There is an open-source version called dbt Core and a SaaS product named dbtCloud.
  • Modern Data Catalog: Also known as Data Catalog version 3.0, this includes business & technical metadata, data lineage capabilities like the prior generations, and the automation and collaboration among the teams to share the knowledge about data. Some well-known commercial players in this area are Atlan, Secoda, and open-source options like Linkedin’s DataHub.
  • Reverse ETL: The idea is to master the customer records in the data warehouse from multiple source apps and then send the data back into the source apps. This completes the missing piece of the automated data integration by copying data back into systems of records. Since the commercial tool offerings in this space already have the knowledge of all the common app schemas and necessary setup to map the data, it is simple to integrate the data back into the source systems. The only caveat is the data latency with this approach. You can opt for Customer data platforms (CDPs) instead of the reverse ETL for real-time integration. Some leading players in the field of reverse ETL are Hightouch and Census.

Conclusion

We’ve had a great start in our journey towards replacing our traditional legacy data platform while using Modern Data Stack (MDS) as an initial reference. We then further augment it with reverse ETL and Modern Data Catalog for a well-rounded solution. There is a great ecosystem and harmony between the vendors in this space so we did not come across any incompatibility between the tools except in the case of Data Catalog options, where we had to choose the options carefully to make sure we also had connectivity with our legacy platform in addition to the modern stack to help us in migration.

We also account for the fact that we would need to run both traditional and modernized data stack in parallel for the duration of migration. Another point of consideration is the careful review of the terms & conditions and any special agreements necessary with the SaaS vendors to protect and meet your specific business compliance requirements. The next part of this series will include our learnings from the implementation of the modern data stack and migration journey.

Share this post