25
loading...
This website collects cookies to deliver better user experience
$ pip install pipx
$ pipx install dvc
$ dvc --version
2.3.0
git checkout
to the tag to follow along.$ git clone https://github.com/Lee-W/dvc_example/ --branch v1-base
$ cd dvc_example
$ tree
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
pipenv install
, you can run export SYSTEM_VERSION_COMPAT=1
before it. It's an open issue (Issue with NumPy, macOS 11 Big Sur, Python 3.9.1 Does pipenv not use the latest pip? #4564) of pipenv as of now. Or, you can just run the following commands.# install needed tools
pipx install pipenv invoke
# set up environments
invoke init-dev
def main():
X, y = load_data()
X_train, X_test, y_train, y_test = process_data(X, y)
model = train_model(X_train, y_train)
predicted_y = model.predict(X_test)
output_results(y_test, predicted_y)
output_metrics(y_test, predicted_y)
pipenv install dvc
pipenv install dvc[s3]
)[s3]
[azure]
[gdrive]
[gs]
[oss]
[ssh]
pipenv install dvc[all]
to install them all# initialize DVC configurations
$ pipenv run dvc init
# see what's created by DVC
$ tree .dvc
.dvc
├── config
└── plots
├── confusion.json
├── confusion_normalized.json
├── default.json
├── linear.json
├── scatter.json
└── smooth.json
# track DVC configuration through git
$ git add .dvc
# git commit
$ pipenv run cz commit
../dvc_remote
as our remote storage. You can change it to s3 or other remote storage.mkdir ../dvc_remote
dvc remote add --default local ../dvc_remote
--default
flag, we can push/pull from local
remote without specifying remote name..dvc/config
.$ cat .dvc/config
[core]
remote = local
['remote "local"']
url = ../../dvc_remote
../../dvc_remote
instead of ../dvc_remote
because it's the relative path to .dvc
. As we've not yet push anything to our pseudo remote, ../dvc_remote
is still empty.data/
.def load_data():
# Load data
digits = datasets.load_digits()
...
data/
. Note that it's a one-time use script. We won't add it into git.import os
import pandas as pd
from sklearn import datasets
os.mkdir("data")
digits = datasets.load_digits()
df = pd.DataFrame(digits.data)
df.to_csv("data/digit_data.csv", header=False, index=False)
df = pd.DataFrame(digits.target)
df.to_csv("data/digit_target.csv", header=False, index=False)
load_data
and main
functions to read data from these files.def load_data(X_path, y_path):
with open(X_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
X = list(csv_reader)
with open(y_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
y = [row[0] for row in csv_reader]
return X, y
......
def main():
X, y = load_data("data/digit_data.csv", "data/digit_target.csv")
......
pipenv run python digit_recognizer/digit_recognizer.py
to check whether everything works as we expected. If so, add these code changes into git.data/
to DVC.$ pipenv run dvc add data
100% Add|████████████████|1/1 [00:00, 2.14file/s]
To track the changes with git, run:
git add data.dvc .gitignore
dvc add
creates a data.dvc
file to track data/
and add it into .gitignore
so that data/
will only be tracked through DVC but not git.# Add DVC files into git track
git add .gitignore data.dvc
# git commit
pipenv run cz commit
data.dvc
, we can see 2 files (digit_data.csv
and digit_target.csv
) are tracked.$ cat data.dvc
outs:
- md5: b8d81f4964ecb86739c79c833fb491f3.dir
size: 494728
nfiles: 2
path: data
dvc push
../dvc_remote
$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 8c
│ └── ba569595920d230ade453b150f372b
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
3 directories, 3 files
b8d81f4964ecb86739c79c833fb491f3.dir
. There's also a corresponding file in ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
.$ cat ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
[{"md5": "02b861b6dc8e08da6d66547860f69277", "relpath": "digit_data.csv"}, {"md5": "8cba569595920d230ade453b150f372b", "relpath": "digit_target.csv"}]%
../dvc_remote
.*.dvc
in our project# temporary delete our data locally
$ rm -rf data
# check whether DVC actually tracks our data
$ dvc status
data.dvc:
changed outs:
deleted: data
# bring our data back from remote storage
$ dvc checkout data
data
├── digit_data.csv
└── digit_target.csv
data/digit_data.csv
and data/digit_target.csv
.# check what's changed
$ dvc status
data.dvc:
changed outs:
modified: data
# Add these changes to DVC and git
$ dvc add
$ git add data.dvc
# git commit
$ pipenv run cz commit
# Push these changes to our remote storage
$ dvc push
$ cat data.dvc
outs:
- md5: a333e114a49194e823ab9a4fa9e33ee9.dir
size: 494172
nfiles: 2
path: data
../dvc_remote
due to the data changes. You can follow the steps in the previous section to see what're actually store.$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 2a
│ └── 6cfa13365ac9b3af5146133aca6789
├── 8c
│ └── ba569595920d230ade453b150f372b
├── 94
│ └── 2481fce846fb9750b7b8023c80a5ef
├── a3
│ └── 33e114a49194e823ab9a4fa9e33ee9.dir
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
6 directories, 6 files
git checkout
to the previous git commit to see what happens if we only revert the changes in data.dvc
.# or "git checkout v2-track-data"
git checkout HEAD~1
wc -l data/digit_data.csv
, we'll still find 1795 rows instead of 1797 rows in the previous stage. That's because we need to run dvc checkout
as well.dvc checkout
right after git checkout
. You can install these git-hooks through dvc install
. These hooks are added into .git/hooks
. If you want to know the detail of what's added, read dvc install.git checkout
. This is the output message of dvc checkout
.M data/
git remote add origin <REMOTE GIT REPO>
git push origin main
# check what's in our repo
$ dvc list <REMOTE GIT REPO>
.dvcignore
.github
.gitignore
LICENSE
Pipfile
Pipfile.lock
data
data.dvc
digit_recognizer
docs
mkdocs.yml
output
tasks.py
data/
, we can still list it through DVC.../dvc_remote
as DVC remote storage, we need to create the new project in the same layer as dvc_example
. We'll clone the project into ../dvc_example_on_another_machine
.# Clone repo git repo
$ git clone <YOUR REMOTE GIT REPO> ../dvc_example_on_another_machine
$ cd ../dvc_example_on_another_machine
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
3 directories, 9 files
data/
has not yet been added to the project. We can now pull data from our DVC remote storage.# pull data from default DVC remote storage
$ dvc pull
A data/
1 file added and 2 files fetched
# `data` has now been added to the project
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data
│ ├── digit_data.csv
│ └── digit_target.csv
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
4 directories, 11 files
dvc_example_on_another_machine
for the following steps. Feel free to remove it and change directory back to dvc_example
.25