predict-otron-9000

A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

Powerful local AI inference with OpenAI-compatible APIs

Project Overview

The predict-otron-9000 is a flexible AI platform that provides:

  • Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
  • Embeddings Generation: Create text embeddings with FastEmbed
  • Web Interface: Interact with models through a Leptos WASM chat interface
  • TypeScript CLI: Command-line client for testing and automation
  • Production Deployment: Docker and Kubernetes deployment options

The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.

Features

  • OpenAI Compatible: API endpoints match OpenAI's format for easy integration
  • Text Embeddings: Generate high-quality text embeddings using FastEmbed
  • Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
  • Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
  • Web Chat Interface: Leptos-based WebAssembly (WASM) chat interface for browser-based interaction
  • Flexible Deployment: Run as monolithic service or microservices architecture

Architecture Overview

Workspace Structure

The project uses a 7-crate Rust workspace plus TypeScript components:

crates/
├── predict-otron-9000/     # Main orchestration server (Rust 2024)
├── inference-engine/       # Multi-model inference orchestrator (Rust 2021)
├── gemma-runner/          # Gemma model inference via Candle (Rust 2021)
├── llama-runner/          # Llama model inference via Candle (Rust 2021)
├── embeddings-engine/     # FastEmbed embeddings service (Rust 2024)
├── leptos-app/            # WASM web frontend (Rust 2021)
├── helm-chart-tool/       # Kubernetes deployment tooling (Rust 2024)
└── scripts/
    └── cli.ts             # TypeScript/Bun CLI client

Service Architecture

  • Main Server (port 8080): Orchestrates inference and embeddings services
  • Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
  • Web Frontend (port 8788): Leptos WASM chat interface served by Trunk
  • CLI Client: TypeScript/Bun client for testing and automation

Deployment Modes

The architecture supports multiple deployment patterns:

  1. Development Mode: All services run in a single process for simplified development
  2. Docker Monolithic: Single containerized service handling all functionality
  3. Kubernetes Microservices: Separate services for horizontal scalability and fault isolation

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

  • Editions: Mixed - main services use Rust 2024, some components use 2021
  • Recommended: Latest stable Rust toolchain: rustup default stable && rustup update
  • Developer tools:
    • rustup component add rustfmt (formatting)
    • rustup component add clippy (linting)

Node.js/Bun Toolchain

  • Bun: Required for TypeScript CLI client: curl -fsSL https://bun.sh/install | bash
  • Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+

WASM Frontend Toolchain

  • Trunk: Required for Leptos frontend builds: cargo install trunk
  • wasm-pack: cargo install wasm-pack
  • WASM target: rustup target add wasm32-unknown-unknown

ML Framework Dependencies

  • Candle: Version 0.9.1 with conditional compilation:
    • macOS: Metal support with CPU fallback for stability
    • Linux: CUDA support with CPU fallback
    • CPU-only: Supported on all platforms
  • FastEmbed: Version 4.x for embeddings functionality

Hugging Face Access

  • Required for: Gemma model downloads (gated models)
  • Authentication:
    • CLI: pip install -U "huggingface_hub[cli]" && huggingface-cli login
    • Environment: export HF_TOKEN="<your_token>"
  • Cache management: export HF_HOME="$PWD/.hf-cache" (optional, keeps cache local)
  • Model access: Accept Gemma model licenses on Hugging Face before use

Platform-Specific Notes

  • macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
  • Linux: CUDA support with BF16 precision on GPU, F32 on CPU
  • Conditional compilation: Handled automatically per platform in Cargo.toml

Build Procedures

Full Workspace Build

cargo build --workspace --release

Individual Services

Main Server:

cargo build --bin predict-otron-9000 --release

Inference Engine CLI:

cargo build --bin cli --package inference-engine --release

Embeddings Service:

cargo build --bin embeddings-engine --release

Web Frontend:

cd crates/leptos-app
trunk build --release

Running Services

Main Server (Port 8080)

./scripts/run_server.sh
  • Respects SERVER_PORT (default: 8080) and RUST_LOG (default: info)
  • Boots with default model: gemma-3-1b-it
  • Requires HF authentication for first-time model download

Web Frontend (Port 8788)

cd crates/leptos-app
./run.sh
  • Serves Leptos WASM frontend on port 8788
  • Sets required RUSTFLAGS for WebAssembly getrandom support
  • Auto-reloads during development

TypeScript CLI Client

# List available models
bun run scripts/cli.ts --list-models

# Chat completion
bun run scripts/cli.ts "What is the capital of France?"

# With specific model
bun run scripts/cli.ts --model gemma-3-1b-it --prompt "Hello, world!"

# Show help
bun run scripts/cli.ts --help

API Usage

Health Checks and Model Inventory

curl -s http://localhost:8080/v1/models | jq

Chat Completions

Non-streaming:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 64
  }' | jq

Streaming (Server-Sent Events):

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default", 
    "messages": [{"role": "user", "content": "Tell a short joke"}],
    "stream": true,
    "max_tokens": 64
  }'

Model Specification:

  • Use "model": "default" for configured model
  • Or specify exact model ID: "model": "gemma-3-1b-it"
  • Requests with unknown models will be rejected

Embeddings API

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Web Frontend

  • Navigate to http://localhost:8788
  • Real-time chat interface with the inference server
  • Supports streaming responses and conversation history

Testing

Test Categories

  1. Offline/fast tests: No network or model downloads required
  2. Online tests: Require HF authentication and model downloads
  3. Integration tests: Multi-service end-to-end testing

Quick Start: Offline Tests

Prompt formatting tests:

cargo test --workspace build_gemma_prompt

Model metadata tests:

cargo test --workspace which_

These verify core functionality without requiring HF access.

Full Test Suite (Requires HF)

Prerequisites:

  1. Accept Gemma model licenses on Hugging Face
  2. Authenticate: huggingface-cli login or export HF_TOKEN=...
  3. Optional: export HF_HOME="$PWD/.hf-cache"

Run all tests:

cargo test --workspace

Integration Testing

End-to-end test script:

./test.sh

This script:

  • Starts the server in background with proper cleanup
  • Waits for server readiness via health checks
  • Runs CLI tests for model listing and chat completion
  • Includes 60-second timeout and process management

Development

Code Style and Tooling

Formatting:

cargo fmt --all

Linting:

cargo clippy --workspace --all-targets -- -D warnings

Logging:

  • Server uses tracing framework
  • Control via RUST_LOG (e.g., RUST_LOG=debug ./scripts/run_server.sh)

Adding Tests

For fast, offline tests:

  • Exercise pure logic without tokenizers/models
  • Use descriptive names for easy filtering: cargo test specific_test_name
  • Example patterns: prompt construction, metadata selection, tensor math

Process:

  1. Add test to existing module
  2. Run filtered: cargo test --workspace new_test_name
  3. Verify in full suite: cargo test --workspace

OpenAI API Compatibility

Features:

  • POST /v1/chat/completions with streaming and non-streaming
  • Single configured model enforcement (use "model": "default")
  • Gemma-style prompt formatting with <start_of_turn>/<end_of_turn> markers
  • System prompt injection into first user turn
  • Repetition detection and early stopping in streaming mode

CORS:

  • Fully open by default (tower-http CorsLayer::Any)
  • Adjust for production deployment

Architecture Details

Device Selection:

  • Automatic device/dtype selection
  • CPU: Universal fallback (F32 precision)
  • CUDA: BF16 precision on compatible GPUs
  • Metal: Available but routed to CPU for Gemma v3 stability

Model Loading:

  • Single-file model.safetensors preferred
  • Falls back to index resolution via utilities_lib::hub_load_safetensors
  • HF cache populated on first access

Multi-Service Design:

  • Main server orchestrates inference and embeddings
  • Services can run independently for horizontal scaling
  • Docker/Kubernetes metadata included for deployment

Deployment

Docker Support

All services include Docker metadata in Cargo.toml:

Main Server:

  • Image: ghcr.io/geoffsee/predict-otron-9000:latest
  • Port: 8080

Inference Service:

  • Image: ghcr.io/geoffsee/inference-service:latest
  • Port: 8080

Embeddings Service:

  • Image: ghcr.io/geoffsee/embeddings-service:latest
  • Port: 8080

Web Frontend:

  • Image: ghcr.io/geoffsee/leptos-app:latest
  • Port: 8788

Docker Compose:

# Start all services
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Support

All services include Kubernetes manifest metadata:

  • Single replica deployments by default
  • Service-specific port configurations
  • Ready for horizontal pod autoscaling

For Kubernetes deployment details, see the ARCHITECTURE.md document.

Build Artifacts

Ignored by Git:

  • target/ (Rust build artifacts)
  • node_modules/ (Node.js dependencies)
  • dist/ (Frontend build output)
  • .fastembed_cache/ (FastEmbed model cache)
  • .hf-cache/ (Hugging Face cache, if configured)

Common Issues and Solutions

Authentication/Licensing

Symptom: 404 or permission errors fetching models
Solution:

  1. Accept Gemma model licenses on Hugging Face
  2. Authenticate with huggingface-cli login or HF_TOKEN
  3. Verify token with huggingface-cli whoami

GPU Issues

Symptom: OOM errors or GPU panics
Solution:

  1. Test on CPU first: ensure CUDA_VISIBLE_DEVICES="" if needed
  2. Check available VRAM vs model requirements
  3. Consider using smaller model variants

Model Mismatch Errors

Symptom: 400 errors with type=model_mismatch
Solution:

  • Use "model": "default" in API requests
  • Or match configured model ID exactly: "model": "gemma-3-1b-it"

Frontend Build Issues

Symptom: WASM compilation failures
Solution:

  1. Install required targets: rustup target add wasm32-unknown-unknown
  2. Install trunk: cargo install trunk
  3. Check RUSTFLAGS in leptos-app/run.sh

Network/Timeout Issues

Symptom: First-time model downloads timing out
Solution:

  1. Ensure stable internet connection
  2. Consider using local HF cache: export HF_HOME="$PWD/.hf-cache"
  3. Download models manually with huggingface-cli

Minimal End-to-End Verification

Build verification:

cargo build --workspace --release

Fast offline tests:

cargo test --workspace build_gemma_prompt
cargo test --workspace which_

Service startup:

./scripts/run_server.sh &
sleep 10  # Wait for server startup
curl -s http://localhost:8080/v1/models | jq

CLI client test:

bun run scripts/cli.ts "What is 2+2?"

Web frontend:

cd crates/leptos-app && ./run.sh &
# Navigate to http://localhost:8788

Integration test:

./test.sh

Cleanup:

pkill -f "predict-otron-9000"
pkill -f "trunk"

For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.

Further Reading

Documentation

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Ensure all tests pass: cargo test
  5. Submit a pull request

Warning: Do NOT use this in production unless you are cool like that.

Description
OpenAI Compatible API Server (Limited Features)
Readme 1.7 MiB
Languages
Rust 87.1%
TypeScript 5.8%
Shell 4.7%
SCSS 1.9%
Dockerfile 0.5%