31
loading...
This website collects cookies to deliver better user experience
model = get_model()
virtualenv --python /usr/bin/python3.8 .venv
source .venv/bin/activate
# --- requirements.txt
fastapi~=0.61.1
requirements.txt
. You can install the modules with pip
. There are plenty of guides on how to get pip on your system if you don’t have it:pip install -r requirements.txt
main.py
file. If you prefer, you can clone the FastAPI template published at https://github.com/cosimo/fastapi-ml-api:from typing import Optional
from fastapi import FastAPI
app = FastAPI()
model = get_model()
@app.post("/cluster")
def cluster():
return {"Hello": "World"}
uvicorn main:app --reload
gunicorn -c gunicorn_conf.py -k uvicorn.workers.UvicornWorker --preload main:app
-k
tells gunicorn to use a specific worker classmain:app
instructs gunicorn to load the main module and use app
(in this case the FastAPI instance) as the application code that all workers should be running--preload
causes gunicorn to change the worker startup procedure--preload
option inverts the sequence of operations by loading the application instance first and then forking all worker processes. Because of how fork()
works, each worker process will be a copy of the main gunicorn process and will share (part of) the same memory space.model
variable to be “shared” across all processes!pytorch.multiprocessing
in the gunicorn
configuration modulegunicorn
itself (!) to use pytorch.multiprocessing
to load the model. I did it just as a prototype, but even then… bad idea/dev/shm
(Linux shared memory tmpfs) as a filesystem where to store the Pytorch model filegunicorn
must create the FastAPI application to start it, so I loaded the model (as a global) when creating the FastAPI application, and verified the model was loaded before that, and only loaded once.preload_app = True
option to gunicorn’s configuration module.max_requests = 50
. I limited the amount of requests because I noticed a sudden increase in memory usage in each worker regularly some minutes after startup. I couldn’t trace it back to something specific, so I used this dirty workaround.async
on my FastAPI application methods. Other people have reported this solution not working for them… This remains to be understood..eval()
and .share_memory()
methods on it, before returning it to the FastAPI application. This is happening just on first load.def load_language_model() -> SentenceTransformer:
language_model = SentenceTransformer(SOME_MODEL_NAME)
language_model.eval()
language_model.share_memory()
return language_model
preload_app = True
.eval()
and .share_memory()
if your model is PyTorch-based# Preload the FastAPI application, so we can load the PyTorch model
# in the parent gunicorn process and share its memory with all the workers
preload_app = True
# Limit the amount of requests a single worker will handle, so as to
# curtail the increase in memory usage of each worker process
max_requests = 50
Dockerfile
. It’s easily applicable as a development option but also good for production in case you deploy to a platform like Kubernetes like I did.# --- Dockerfile.stage1
# https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8
# Install PyTorch CPU version
# https://pytorch.org/get-started/locally/#linux-pip
RUN pip3 install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
# Here I'm using sentence_transformers, but you can use any library you need
# and make it download the model you plan using, or just copy/download it
# as appropriate. The resulting docker image should have the model bundled.
RUN pip3 install sentence_transformers==0.3.8
RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("")'
stage1
tag.# --- Dockerfile
FROM $(REGISTRY)/$(PROJECT):stage1
# Gunicorn config uses these env variables by default
ENV LOG_LEVEL=info
ENV MAX_WORKERS=3
ENV PORT=8000
# Give the workers enough time to load the language model (30s is not enough)
ENV TIMEOUT=60
# Install all the other required python dependencies
COPY ./requirements.txt /app
RUN pip3 install -r /app/requirements.txt
COPY ./config/gunicorn_conf.py /gunicorn_conf.py
COPY ./src /app
# COPY ./tests /tests
--shm-size=1.75G
for example, or any suitable amount of memory for your own model, as in:docker run --shm-size=1.75G --rm <command>
apiVersion: apps/v1
kind: Deployment
metadata:
...
spec:
...
template:
...
spec:
volumes:
- name: modelsharedmem
emptyDir:
sizeLimit: "1750Mi"
medium: "Memory"
containers:
- name: {{ .Chart.Name }}
...
volumeMounts:
- name: modelsharedmem
mountPath: /dev/shm
...
Makefile
to my projects, to create a memory of the commands needed to start a server, run tests or build containers. I don’t need to use brain power to memorize any of that, and it’s easy for colleagues to understand what commands are used for which purpose.# --- Makefile
PROJECT=myproject
BRANCH=main
REGISTRY=your.docker.registry/project
.PHONY: docker docker-push start test
start:
./scripts/start.sh
# Stage 1 image is used to avoid downloading 2 Gb of PyTorch + nlp models
# every time we build our container
docker-stage1:
docker build -t $(REGISTRY)/$(PROJECT):stage1 -f Dockerfile.stage1 .
docker push $(REGISTRY)/$(PROJECT):stage1
docker:
docker build -t $(REGISTRY)/$(PROJECT):$(BRANCH) .
docker-push:
docker push $(REGISTRY)/$(PROJECT):$(BRANCH)
test:
JSON_LOGS=False ./scripts/test.sh