Deploying Your AI Models with LitServe: A Step-by-Step Tutorial
Have you ever trained a fantastic AI model, only to hit a roadblock when it comes to deploying it for real-world use? Moving a model from a Jupyter Notebook to a scalable, production-ready API can be a daunting task. That’s where LitServe comes in!
LitServe, an open-source serving engine built on FastAPI, simplifies the deployment of your AI models. It’s optimized for AI workloads, offering features like batching, streaming, and GPU autoscaling to ensure your model performs efficiently under demand.
In this tutorial, we’ll walk you through the process of taking your AI model and deploying it using LitServe, making it accessible via a high-performance API.
Prerequisites
Before we begin, make sure you have:
- Python 3.8+ installed
- Familiarity with Python and basic AI concepts
Step 1: Install LitServe
First things first, let’s get LitServe installed. Open your terminal or command prompt and run:
pip install litserve
This will install LitServe and its necessary dependencies, including FastAPI.
Step 2: Define Your Model API with LitAPI
The heart of your LitServe deployment is the LitAPI
class. This class acts as the bridge between your incoming API requests and your AI model. You’ll define how requests are handled, how your model processes data, and how responses are formatted.
Let’s break down the key methods you’ll implement within your LitAPI
class:
setup(self, device)
: This method is called once when your server starts. It’s the perfect place to load your pre-trained model, set up any required resources (like tokenizers), and move your model to the specified device (CPU or GPU). LitServe automatically provides thedevice
argument based on your server configuration.decode_request(self, request)
: For each incoming request, this method transforms the raw request payload (e.g., JSON, image bytes) into a format your model expects as input.predict(self, x)
: This is where your AI model does its magic! It takes the decoded inputx
(or a batch of inputs) and runs your model to generate predictions.encode_response(self, output)
: Finally, this method takes the output from yourpredict
method and formats it into the desired response to send back to the client (e.g., a JSON dictionary, a list).
Let’s create a file named server.py
and put the following code inside it. For this tutorial, we’ll use a simple PyTorch model, but you can adapt it to any framework (TensorFlow, JAX, scikit-learn, etc.).
import litserve as ls
import torch
import torch.nn as nn
import os
# 1. Define your AI Model
# Replace this with your actual, pre-trained model.
# For demonstration, we'll create a simple linear model.
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 1) # Expects 10 input features, outputs 1 value
def forward(self, x):
return self.linear(x)
# 2. Define your LitAPI class
class MyLitAPI(ls.LitAPI):
def setup(self, device):
# Load your model here.
# If you saved your model (e.g., with torch.save('model.pth')), load it like this:
# model_path = "path/to/your/model.pth"
# self.model = SimpleModel().to(device)
# self.model.load_state_dict(torch.load(model_path, map_location=device))
# For this example, we'll just instantiate a dummy model.
self.model = SimpleModel().to(device)
self.model.eval() # Set model to evaluation mode
print(f"✅ Model loaded successfully on device: {device}")
def decode_request(self, request):
# We expect a JSON request like: {"input": [1.0, 2.0, ..., 10.0]}
= request.json()
data = data.get("input")
input_data if not isinstance(input_data, list) or len(input_data) != 10:
raise ValueError("Invalid input format. Expected a list of 10 numbers.")
# Convert to a PyTorch tensor
return torch.tensor(input_data, dtype=torch.float32).unsqueeze(0) # unsqueeze for batch dim
def predict(self, x):
with torch.no_grad(): # Disable gradient calculation for inference
= self.model(x)
output return output.squeeze(0).tolist() # Convert tensor output to a list
def encode_response(self, output):
# Format the model's output as a JSON response
return {"prediction": output}
# 3. Run the LitServe server
if __name__ == "__main__":
= MyLitAPI()
api
# Configure the LitServer
# accelerator="auto" will automatically use GPU if available, otherwise CPU.
# You can also specify "cpu", "cuda", or "gpu".
# max_batch_size and batch_timeout are crucial for performance!
= ls.LitServer(
server
api,="auto",
accelerator=16, # Process up to 16 requests at once if they arrive quickly
max_batch_size=0.1, # Wait up to 0.1 seconds for a full batch
batch_timeout=1 # Number of API workers per CPU/GPU
workers_per_device
)
# Start the server on port 8000
=8000)
server.run(portprint(f"\n🚀 LitServe is running on http://127.0.0.1:8000")
A quick note on Batching: Notice the max_batch_size
and batch_timeout
arguments in ls.LitServer
. These are powerful features that allow LitServe to group multiple incoming requests and process them as a single batch through your model. This significantly boosts GPU utilization and overall throughput, especially for models that benefit from parallel processing.
Step 3: Run Your LitServe Application
Now that you’ve defined your LitAPI
, it’s time to bring your server online! You have a few deployment options:
Option A: Local Self-Hosting (for Development)
For quick testing and development, you can run your server directly from your Python script:
python server.py
You should see output indicating that LitServe is starting up, typically on http://127.0.0.1:8000
.
Option B: Deploy with Lightning AI Cloud (Recommended for Production)
LitServe is developed by Lightning AI, and they provide a fantastic managed platform for seamless deployment, autoscaling, and production-grade features. This is often the most straightforward and robust way to deploy your models at scale.
- Install the Lightning CLI:
bash pip install lightning
- Log in to Lightning AI:
bash lightning login
This command will open your web browser to authenticate your account. - Deploy your server: Navigate to the directory containing your
server.py
file and run:bash lightning deploy server.py --cloud
The Lightning CLI will automatically package your application (including anyrequirements.txt
file in your directory), build a Docker image, upload it, and deploy it to a scalable endpoint. You’ll receive a public URL for your deployed API!
Option C: Self-Hosting with Docker (Advanced)
For complete control over your deployment environment on your own infrastructure (local, cloud VM, etc.), you can containerize your LitServe application using Docker.
Create a
requirements.txt
file:litserve torch # Add any other dependencies your model needs (e.g., transformers, numpy, etc.)
Create a
Dockerfile
in the same directory asserver.py
andrequirements.txt
:# Use a suitable base image for your model (e.g., with PyTorch and CUDA) FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime WORKDIR /app # Copy your application files COPY server.py . COPY requirements.txt . # If you have a saved model file, copy it too: # COPY path/to/your/model.pth . # Install dependencies RUN pip install -r requirements.txt # Expose the port LitServe will run on EXPOSE 8000 # Command to run your LitServe server CMD ["python", "server.py"]
Build your Docker image:
bash docker build -t my-litserve-model .
Run your Docker container:
bash docker run -p 8000:8000 my-litserve-model
This command maps port 8000 on your host machine to port 8000 inside the Docker container, making your API accessible.
Step 4: Test Your Deployment
Once your LitServe application is running, let’s test it out!
Using Python (Recommended)
Create a new Python file (e.g., client.py
) and add the following:
import requests
import json
# Replace with your deployed URL if you're using Lightning AI Cloud
= "http://127.0.0.1:8000/predict"
url
# Example input for our SimpleModel (10 numbers)
= {"input": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]}
input_data
try:
= requests.post(url, json=input_data)
response # Raise an HTTPError for bad responses (4xx or 5xx)
response.raise_for_status()
print("Success!")
print(json.dumps(response.json(), indent=2))
except requests.exceptions.HTTPError as err:
print(f"HTTP Error: {err}")
print(f"Response Content: {err.response.text}")
except requests.exceptions.ConnectionError as err:
print(f"Connection Error: {err}. Make sure your server is running!")
except Exception as err:
print(f"An unexpected error occurred: {err}")
Run this client script:
python client.py
You should see output similar to:
Success!
{
"prediction": [
-0.1009121686220169
]
}
(The exact prediction value will vary based on the random initialization of the SimpleModel
).
Using curl
(for quick tests)
If you prefer using curl
from your terminal:
curl -X POST -H "Content-Type: application/json" -d '{"input": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]}' http://127.0.0.1:8000/predict
Next Steps and Advanced Considerations
You’ve successfully deployed your first AI model with LitServe! Here are some key considerations for moving towards robust production deployments:
- Error Handling: Implement more specific error handling within your
decode_request
andpredict
methods to provide meaningful messages to your API users. - Authentication: For production APIs, you’ll need security. LitServe supports API key authentication out of the box.
- Model Versioning: As your models evolve, plan for how you’ll manage different versions of your API.
- Logging and Monitoring: Set up comprehensive logging to track requests, responses, and potential issues, and integrate with monitoring tools.
- Resource Tuning: Experiment with
max_batch_size
,batch_timeout
,workers_per_device
, and theaccelerator
setting inLitServer
to find the optimal configuration for your specific model and hardware. - Complex Pipelines: LitServe is flexible enough to handle more complex scenarios where multiple models might be chained together in an inference pipeline.
LitServe provides a powerful yet user-friendly way to serve your AI models. By leveraging its optimizations and the streamlined deployment options, you can get your AI solutions into the hands of users faster and more efficiently. Happy deploying!