25
loading...
This website collects cookies to deliver better user experience
versioned: true
and we will be able to compare results over time. Yes kedro makes it that easy to setup.pip install pipx
pipx run kedro new
cd versioned-partitioned-kedro-example
conda create -n versioned-partitioned-kedro-example python=3.8 -y
conda activate versioned-partitioned-kedro-example
pip install kedro
kedro install
git init
git add .
git commit -m "init project from pipx run kedro new"
cookiecutter
⚠️ Please do not skip out on using a virtual environment. You may use whichever virtual environment tool you prefer, but please do not skip out. Wrecking a running project for learning is not fun.
kedro[pandas]
and find-kedro
. Since those are extra packages our example will require.aiohttp
black==21.5b1
find-kedro
flake8>=3.7.9, <4.0
ipython
isort~=5.0
jupyter_client>=5.1, <7.0
jupyterlab~=3.0
jupyter~=1.0
kedro-telemetry~=0.1.0
kedro==0.17.4
kedro[pandas]
nbstripout~=0.4
pytest-cov~=2.5
pytest-mock>=1.7.1, <2.0
pytest~=6.2
requests
wheel>=0.35, <0.37
find-kedro
, and I like using it to create my pipeline object. Think of how pytest automatically picks up everything named test
, find-kedro
does the same thing for kedro. It picks up everything with node
or pipeline
in the name and creates pipelines out of it.requirements.in
, we can tell kedro to install everything and compile the dependencies. Behind the scenes --build-reqs
uses a library called pip-compile
to create a requirements.txt
file with hard pinned dependencies, which is ideal for creating reproducible projects. You and your future colleagues may not thank you for this, but they sure as heck won't be cursing your name when they can'tkedro install --build-reqs
git add .
git commit -m "added additional dependencies"
cars.csv
from a URL to a parquet
file. I am going to use a lambda to build my identity function inline.# pipelines/cars_nodes.py
from kedro.pipeline import node
nodes = []
nodes.append(
node(
func=lambda x:x,
inputs='raw_cars',
outputs='int_cars',
name='create_int_cars',
)
)
🗒️ notefind-kedro' will automatically pick up these nodes for us after we set up our
pipeline_registry.py`.
bash
git add .
git commit -m "add create_int_cars node"
find-kedro
comes in. Once we point to the directory where our modules of nodes/pipelines are, it creates the pipelines dictionary for us automatically. It will even separate each module into a pipeline and stitch them all into one default pipeline.
` pythonReturns:
A mapping from a pipeline name to a "Pipeline "object.
"""
pipeline_dir = Path(__file__).parent / 'pipelines'
return find_kedro(directory= pipeline_dir)
🗒️ This is very similar to the default ` pipeline_registry'except the last two
lines.
git add .
git commit -m "implement find-kedro"
MemoryDataSet
's. Thus, using the cli helps consistently scaffold the catalog and ensure we don't end up with a typo in our dataset name.kedro catalog create --pipeline cars_nodes
base/catalog/cars_nodes.yml
raw_cars:
type: MemoryDataSet
int_cars:
type: MemoryDataSet
🔥 use the kedro cli to fill in any missing datasets from the automatically catalog.
MemoryDataSet
's for us. We will convert them to the appropriate dataset type and turn on versioning for our int
layer, which is the first point we save in our environment.raw_cars:
type: pandas.CSVDataSet
filepath: https://waylonwalker.com/cars.csv
int_cars:
type: pandas.ParquetDataSet
filepath: data/int_cars.parquet
versioned: true
git add .
git commit -m "create catalog"
int_cars.parquet
directory.kedro run
kedro run
kedro run
kedro run
kedro run
🗒️ we put our data in the data directory. By default, this directory is included in the .gitignore
and will not be picked up by git.
data/int_cars.parquet
shows that I now have five different datasets available. I can load old ones, but by default, kedro will load the latest one.ls data/int_cars.parquet
2021-07-05T15.24.53.164Z
2021-07-05T15.29.56.144Z
2021-07-05T15.30.23.101Z
2021-07-05T15.30.26.555Z
2021-07-05T15.31.12.688Z
🗒️ kedro sets the version at the timestamp that the session starts. All datasets created within the same run will have the same version.
type: PartitionedDataSet
, with a path
pointing to the same place as the original, and a dataset
type the same as the original.int_cars_partitioned:
type: PartitionedDataSet
dataset: pandas.ParquetDataSet
path: data/int_cars.parquet
PartitionedDataSet
.In [17]: context.catalog.list()
Out[17]:
['raw_cars',
'int_cars',
'int_cars_partitioned',
'parameters']
context.catalog.load('int_cars_incremental')
.In [19]: context.catalog.load('int_cars_incremental')
2021-07-05 11:32:40,534 - kedro.io.data_catalog - INFO - Loading data from `int_cars_incremental` (IncrementalDataSet)...
Out[19]:
{'2021-07-05T15.29.56.144Z/int_cars.parquet': Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2,
'2021-07-05T15.30.23.101Z/int_cars.parquet': Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2,
'2021-07-05T15.30.26.555Z/int_cars.parquet': Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2,
'2021-07-05T15.31.12.688Z/int_cars.parquet': Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2}
👆 notice that incremental datasets are all loaded for you, its a dict of filepath:dataset
PartitionedDataSet
. We can add it to the catalog in a very similar way to how we added the IncrementalDataSet
.int_cars_incremental:
type: IncrementalDataSet
dataset: pandas.ParquetDataSet
path: data/int_cars.parquet
PartitionedDataSet
's are likely a better option for this use.In [18]: context.catalog.load('int_cars_partitioned')
2021-07-05 11:31:11,253 - kedro.io.data_catalog - INFO - Loading data from `int_cars_partitioned` (PartitionedDataSet)...
Out[18]:
{'2021-07-05T15.29.56.144Z/int_cars.parquet': <bound method AbstractVersionedDataSet.load of <kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet object at 0x7f4bb1570820>>,
'2021-07-05T15.30.23.101Z/int_cars.parquet': <bound method AbstractVersionedDataSet.load of <kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet object at 0x7f4bb1570850>>,
'2021-07-05T15.30.26.555Z/int_cars.parquet': <bound method AbstractVersionedDataSet.load of <kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet object at 0x7f4bb1570910>>,
'2021-07-05T15.31.12.688Z/int_cars.parquet': <bound method AbstractVersionedDataSet.load of <kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet object at 0x7f4bb15709a0>>}
IncrementalDataSet
's and PartitionedDataSet
's are very similar as they give you access to a whole directory of data that uses the same underlying dataset loader. The significant difference is whether you want your data pre-loaded or if you want to load and dispose of it as you iterate over it.PartitionedDataSet
to collect stats on our dataset over time. This node does a dict comprehension to get the length of each version that we pulled.def timeseries_partitioned(cars: Dict):
return {k:len(car()) for k, car in cars.items()}
nodes.append(
node(
func=timeseries_partitioned,
inputs='int_cars_partitioned',
outputs='int_cars_timeseries_partitioned',
name='create_int_cars_timeseries_partitioned',
)
)
🗒️ note that inside of the dict comprehension car is a load function that we need to call.
IncrementalDataSet
looks very similar, except this time car is loaded data inside of the dict comprehension, not a function that we need to call.def timeseries_incremental(cars: Dict):
return {k:len(car) for k, car in cars.items()}
nodes.append(
node(
func=timeseries_incremental,
inputs='int_cars_incremental',
outputs='int_cars_timeseries_incremental',
name='create_int_cars_timeseries_incremental',
)
)
kedro catalog create --pipeline cars_nodes
int_cars_timeseries_partitioned:
type: MemoryDataSet
int_cars_timeseries_incremental:
type: MemoryDataSet
int_cars_timeseries_partitioned:
type: pickle.PickleDataSet
filepath: data/int_cars_timeseries_partitioned.parquet
int_cars_timeseries_incremental:
type: pickle.PickleDataSet
filepath: data/int_cars_timeseries_incremental.parquet
In [32]: context.catalog.load('int_cars_timeseries_incremental')
2021-07-05 12:00:55,014 - kedro.io.data_catalog - INFO - Loading data from `int_cars_timeseries_incremental` (PickleDataSet)...
Out[32]:
{'2021-07-05T15.29.56.144Z/int_cars.parquet': 32,
'2021-07-05T15.30.23.101Z/int_cars.parquet': 32,
'2021-07-05T15.30.26.555Z/int_cars.parquet': 32,
'2021-07-05T15.31.12.688Z/int_cars.parquet': 32,
'2021-07-05T16.43.43.088Z/int_cars.parquet': 32}
In [33]: context.catalog.load('int_cars_timeseries_partitioned')
2021-07-05 12:01:03,223 - kedro.io.data_catalog - INFO - Loading data from `int_cars_timeseries_partitioned` (PickleDataSet)...
Out[33]:
{'2021-07-05T15.29.56.144Z/int_cars.parquet': 32,
'2021-07-05T15.30.23.101Z/int_cars.parquet': 32,
'2021-07-05T15.30.26.555Z/int_cars.parquet': 32,
'2021-07-05T15.31.12.688Z/int_cars.parquet': 32,
'2021-07-05T16.43.43.088Z/int_cars.parquet': 32,
'2021-07-05T16.50.46.686Z/int_cars.parquet': 32}
☝️ I have a full article on creating datasets that are not tabular datasets using pickle.