18
loading...
This website collects cookies to deliver better user experience
ETL scheduling: One of the undisputed requirements is the ability to schedule different ETLs with unique characteristics. While most teams require their ETL jobs to run daily, some jobs need to run on an hourly, weekly, or monthly basis. Teams not only need the flexibility to specify different scheduling intervals but also different starting/ending times and retrying behaviors for their specific ETL.
Task dependencies: Teams also need to specify dependencies between different ETL jobs. These can be dependencies between different jobs owned by a single team, but can also be extended to include dependencies on jobs owned by other teams, i.e. cross-team dependencies. An example of this is when the Business Intelligence team wants to reuse a table created by the Authentication team to build summary tables that eventually power their dashboards.
Undoing and backfilling: Every team in Adyen strives to productionize their tables fast and iterate on them. This usually means that teams require rerunning some of their ETLs multiple times. Sometimes, data might be corrupted/incomplete for certain date ranges. This inevitably requires us to rerun their ETL pipelines for specified date ranges for certain tables, while also considering their downstream dependencies and (possibly varying) schedule intervals.
Scalability. With its design, it can scale with minimum efforts from the infrastructure team.
Extensible model. It is extremely easy to add custom functionality to Airflow to fulfill specific needs.
Built-in retry policy, sensors, and backfilling. With these features, we can add DAG/task and retroactively run ETL, or we will be on the safe side waiting for the event to trigger DAG.
Monitoring and management interface.
Built-in interface to interact with logs.
Airflow web-server: The main component responsible for the UI that all our stakeholders will interact with. However, downtime of the web server does not automatically translate to ETLs not being able to run (this is handled by the scheduler and workers)
Airflow schedule: Brains of the Airflow. Responsible for DAG serialization, defining DAG execution plan, and communicating with Airflow workers.
Airflow worker: Workhorse of the installation and gets tasks from the scheduler and to run in a specific manner. With workers, we can scale indefinitely. Also, there can be different types of workers with different configurations. At Adyen, we make use of Celery workers.
The broker queue is responsible for keeping track of tasks that were scheduled and still need to be executed. The technology of your choice here should be reliable and scalable. At Adyen, we use Redis.
Relational database for storing metadata needed for DAGs and Airflow to run and storing the results of the task executions. At Adyen, we make use of a Postgres database
Flower. This component is optional if you want to monitor and understand what is happening with Celery workers and the tasks they are executing.
With a worker’s failure, we maintain all the information about the success or failure of the tasks and can reschedule it in the future.
With an edge failure, we still can complete ongoing tasks.
We needed to have old and new installations running at the same time and achieve feature parity. This essentially meant to have all production jobs running simultaneously on both Spoink and Airflow for multiple
You do not add new features to the old installation. We introduced a code freeze for the duration of the migration to avoid adding more moving components to the migration process (2–3 weeks)
You do not migrate teams at one time, but slowly with proper testing and validation.