54
loading...
This website collects cookies to deliver better user experience
An end-to-end project to collect, process, and visualize housing data.
Collect -> Process -> Visualize // 🏠📄 -> 🛠 -> 💻📈
Cost savings 💸.
AWS Step Functions is a low-code, visual workflow service that to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services.
AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers.
"Dagster is a data orchestrator for machine learning, analytics, and ETL. It lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere." Great intro here.
pip install dagster
away, lightweight, extensible, and can be run anywhere - locally, Airflow, Kubernetes, you choose.# Main Step Function inputs
{
"run_data_collect": true or false,
"run_data_process": true or false
}
Easier debugging.
startExecution.sync
. This ensures that the parent Step Function waits until the child Step Function finishes its work.startBuild.sync
.--build-arg
in CodeBuild.Stage | Description | Format |
---|---|---|
Raw | "Raw" data that is gathered in the "Data Gathering" step of the State machine is downloaded to this folder | txt |
Intermediate | Cleaned "raw" data. At this stage redundant columns are removed. Data is cleaned, validated, and mapped. | csv |
Primary | Aggregated data that will be consumed by the front-end. | csv |
@op
is unit of compute work to be done - it should be simple and written in functional style. Larger number of @op
can be connected into a @graph
for convenience. I connected mapping, cleaning, and validation steps into graphs. A logical grouping of ops based on job type. @job
is a fully connected graph of @op
and @graph
units that can be triggered to process data.@op
in the same job in parallel and reusing the @op
s by aliasing them. See the gist below:@job
.@graph
implementation in Dagster.@graph
helps to group operations together and unclutter the UI compared to only op
implementation. Furthermore you can test a full block of operations instead of testing an operation by operation.dagster-aws
model exists. Looking at the module it does exactly what I need, minus the code I had to write.@op
. It's a legit approach but AssetMaterialization
seems like a better, more Dagster-y, way to do it.Settings
class which contained all settings and configs. In hindsight I should have added the Settings
class to Dagster's context
or just use Dagster's config. (I think I carried over the mindset from the previous pure-Python implementation of the data processing pipeline).