How to use Docker LLM locally and talk to it?

Using Docker to run a Large Language Model (LLM) locally is a great way to simplify the deployment process, ensure consistent environments, and manage dependencies. Docker allows you to containerize the LLM, making it easier to run on different machines without worrying about system-specific configurations.

Below is a step-by-step guide on how to use Docker to run an LLM locally and interact with it:


1. Prerequisites

Before you begin, ensure that you have the following installed on your machine:

  • Docker: Install Docker from Docker's official website.

    • For Windows: Use Docker Desktop.
    • For macOS: Use Docker Desktop.
    • For Linux: Follow the installation instructions for your distribution.
  • GPU Drivers (Optional but Recommended): If you plan to use GPU acceleration for faster inference, ensure that you have:

    • NVIDIA GPU drivers installed.
    • NVIDIA Container Toolkit installed to enable GPU support in Docker containers.

2. Choose an LLM Docker Image

There are several pre-built Docker images available for running LLMs locally. Some popular options include:

Option 1: Ollama

  • Ollama is a lightweight tool for running LLMs locally, and it provides Docker images for easy deployment.
  • Docker Hub: Ollama Docker Image

Option 2: Text Generation Web UI

  • Text Generation Web UI is a popular web-based interface for interacting with LLMs. It supports models like Llama, GPT-J, and others.
  • Docker Hub: Text Generation Web UI Docker Image

Option 3: Hugging Face Transformers

  • You can also create a custom Docker image using the Hugging Face Transformers library to run any model available on Hugging Face.
  • Example: Hugging Face Docker Example

For this guide, we'll focus on using Ollama and Text Generation Web UI, as they are beginner-friendly and widely used.


3. Running Ollama via Docker

Step 1: Pull the Ollama Docker Image

First, pull the official Ollama Docker image from Docker Hub:

docker pull ollama/ollama

Step 2: Run the Ollama Container

Run the Ollama container with the following command:

docker run --rm -it -p 11434:11434 ollama/ollama
  • --rm: Automatically removes the container when it stops.
  • -it: Runs the container in interactive mode.
  • -p 11434:11434: Maps port 11434 on your host machine to port 11434 in the container (used for API access).

Step 3: Pull a Model

Once the container is running, you can pull a model using the ollama CLI inside the container. For example, to pull the Llama 2 model:

ollama pull llama2

Step 4: Interact with the Model

You can interact with the model by running:

ollama run llama2

This will open an interactive session where you can type prompts and receive responses from the model.

Step 5: Access the API (Optional)

If you want to interact with the model programmatically, you can use the Ollama API. By default, the API runs on http://localhost:11434. You can send requests using tools like curl or Python's requests library.

Example using curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "What is the capital of France?"
}'

4. Running Text Generation Web UI via Docker

Step 1: Pull the Text Generation Web UI Docker Image

Pull the official Text Generation Web UI Docker image:

docker pull ghcr.io/oobabooga/text-generation-webui:latest

Step 2: Run the Container

Run the container with GPU support (if available). If you don't have a GPU, you can omit the --gpus flag.

docker run --gpus all -p 7860:7860 -v /path/to/models:/models ghcr.io/oobabooga/text-generation-webui:latest
  • --gpus all: Enables GPU support (requires NVIDIA drivers and the NVIDIA Container Toolkit).
  • -p 7860:7860: Maps port 7860 on your host machine to port 7860 in the container (used for the web interface).
  • -v /path/to/models:/models: Mounts a local directory (/path/to/models) to store downloaded models.

Step 3: Access the Web Interface

Once the container is running, open your browser and navigate to:

http://localhost:7860

This will bring up the Text Generation Web UI, where you can interact with the LLM.

Step 4: Download a Model

In the web interface:

  1. Go to the Model tab.
  2. Select a model (e.g., Llama 2, GPT-J, etc.) from the list.
  3. Click Download to fetch the model.

Step 5: Generate Text

After downloading the model, go to the Text Generation tab and start typing prompts. The model will generate responses based on your input.


5. Using Hugging Face Transformers via Docker

If you want to run a custom LLM using the Hugging Face Transformers library, you can create your own Docker image.

Step 1: Create a Dockerfile

Create a file named Dockerfile with the following content:

FROM python:3.9-slim

# Install dependencies
RUN pip install torch transformers sentence-transformers flask

# Copy your app code
COPY app.py /app.py

# Expose the port
EXPOSE 5000

# Run the app
CMD ["python", "app.py"]

Step 2: Create a Flask App

Create a file named app.py with the following content:

from flask import Flask, request, jsonify
from transformers import AutoModelForCausalLM, AutoTokenizer

app = Flask(__name__)

# Load the model and tokenizer
model_name = "meta-llama/Llama-2-7b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data.get('prompt', '')
    
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return jsonify({'response': response})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Step 3: Build the Docker Image

Build the Docker image:

docker build -t my-llm-app .

Step 4: Run the Container

Run the container:

docker run --gpus all -p 5000:5000 my-llm-app

Step 5: Interact with the API

You can now send POST requests to the API:

curl -X POST http://localhost:5000/generate -H "Content-Type: application/json" -d '{"prompt": "What is the capital of France?"}'

6. Tips for Optimizing Performance

6.1. Use GPU Acceleration

Running LLMs on a CPU can be slow. If you have a GPU, ensure that you enable GPU support in Docker using the --gpus flag. This will significantly speed up inference.

6.2. Quantization

To reduce memory usage and improve performance, consider using quantized versions of models (e.g., 4-bit or 8-bit precision). Tools like bitsandbytes can help with quantization.

6.3. Caching

Cache frequently accessed embeddings or responses to reduce latency, especially if you're running the model in a production environment.

6.4. Resource Limits

If you're running multiple containers or services, set resource limits using Docker's --memory and --cpus flags to prevent one container from consuming all available resources.


7. Conclusion

Using Docker to run an LLM locally simplifies the deployment process and ensures consistency across different environments. Whether you choose Ollama, Text Generation Web UI, or a custom solution using Hugging Face Transformers, Docker makes it easy to manage dependencies and scale your application.

By following the steps outlined above, you can:

  • Pull and run pre-built LLM Docker images.
  • Interact with the model via CLI, web interface, or API.
  • Optimize performance using GPU acceleration and quantization.

With Docker, you can efficiently run and interact with LLMs on your local machine, making it a powerful tool for experimentation, development, and deployment.