Overview
What is an Automated Data Pipeline?
An automated data pipeline (ADP) is an end-to-end solution for automating the ingestion, transformation, storage, and presentation of data on a scalable platform. This BoosterPack demonstrates how a series of open-source tools can be integrated to create an ADP. In the Sample Solution, two types of data are used to showcase this capability:
- stock market data, and
- news data.
The main tools used in this BoosterPack are Apache Airflow, Apache Kafka, and MySQL. The entire solution is deployed on a one-node Kubernetes cluster, which is created by leveraging the same technique published in the “Automate Cloud Orchestration with Kubernetes” DAIR BoosterPack.
What value does it add to my business?
This solution:
- automates the processes of ingesting, processing, storing, and presenting data.
- has zero licensing cost as all the tools used in this BoosterPack are open source.
- is cloud agnostic and can be deployed in various cloud platforms.
- implements a microservices-based architecture where the entire application is a collection of loosely coupled services. The individual services are independently deployable, highly maintainable, and testable.
- can be adapted for various business cases.
The advantage of this BoosterPack is that it enables you to select and integrate a range of open-source tools to provide a reliable, scalable, extendable end-to-end solution for real-time (or near real-time) data management
Why choose an Automated Data Pipeline over the alternatives?
Traditionally, organizations must hire architects (data or enterprise) to design the architecture and then have a development team develop the required services. Today, an equivalent ADP framework would take several months to build and cost thousands of dollars to start development from the ground up.
This BoosterPack Sample Solution follows a generalized approach which suits most common data management projects. The Sample Solution helps organizations get started quickly by using this solution as a foundation and then customizing it according to their specific business use case.
The tools used in this solution can be downloaded for free and most have been used in previous BoosterPack Sample Solutions. This BoosterPack provides the ability to select and integrate all these tools in a way that produces a reliable, scalable, and extendable end-to-end solution for real-time (or near real-time) data management.