25
loading...
This website collects cookies to deliver better user experience
DataSets
that the user specifies through Catalog
entries. These Catalog
entries are loaded,ran through a function, and saved by Nodes
. The order that these Nodes
are executed are determined by the Pipeline
, which is a DAG. It's the runner
's job to manage the execution of the Nodes
.This is an updated version of my original what-is-kedro article
kedro
is unopinionated it does determine where or how your data should be ran. The kedro team does support the following Orchestrators with very little add on to the base template.DataSets
like storing pandas DataFrames to parquet, csv, or a sql table. If kedro does not come with support for the type of python objects you work with don't worry, you can for the closest option they support and build your own. Or if you do not want to build your own, you can use a PickleDataSet
for anything.DataSets
to build your data catalog.DataSet
needs. Much of the time this is simply a filepath.test:
type: pandas.CSVDataSet
filepath: s3://your_bucket/test.csv #
Here is the most basic yaml catalog entry taken from the kedro docs
cars:
type: pandas.CSVDataSet
filepath: data/01_raw/company/cars.csv
sep: ','
load_args:
save_args:
index: False
date_format: '%Y-%m-%d %H:%M'
decimal: .
Here is a bit more complex example that takes in load_args
and save_args
docs
import pandas as pd
import numpy as np
def clean_data(cars: pd.DataFrame,
boats: pd.DataFrame) -> Dict[str, pd.DataFrame]:
return dict(cars_df=cars.dropna(), boats_df=boats.dropna())
def halve_dataframe(data: pd.DataFrame) -> List[pd.DataFrame]:
return np.array_split(data, 2)
nodes = [
node(clean_data,
inputs=['cars2017', 'boats2017'],
outputs=dict(cars_df='clean_cars2017',
boats_df='clean_boats2017')),
node(halve_dataframe,
'clean_cars2017',
['train_cars2017', 'test_cars2017']),
node(halve_dataframe,
dict(data='clean_boats2017'),
['train_boats2017', 'test_boats2017'])
]
Here is an example of three nodes taken from their
docs
Pipeline
, is a DAG (Directed Acyclic Graph). It is a graph object that flows in one direction. You can slice into the pipeline using a few built in graph method to_nodes
, from_nodes
, to_outputs
, and from_inputs
. You can chain up these method calls since each one returns a new Pipeline
object. You can also ask a pipline for its edges with inputs
, and outputs
. You can also list every dataset along the way with all_inputs
or all_outputs
.nodes
.from kedro.pipeline import Pipeline, node
# using our nodes from last tim
Pipeline(nodes)
pluggy
framework. Yes the one that pytest
is built on. There are a number of different lifecycle methods that allow us to hook in around where kedro is running such as before_pipeline_run
or