# predict-otron-9000 A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

Powerful local AI inference with OpenAI-compatible APIs

## Project Overview The predict-otron-9000 is a flexible AI platform that provides: - **Local LLM Inference**: Run Gemma and Llama models locally with CPU or GPU acceleration - **Embeddings Generation**: Create text embeddings with FastEmbed - **Web Interface**: Interact with models through a Leptos WASM chat interface - **TypeScript CLI**: Command-line client for testing and automation - **Production Deployment**: Docker and Kubernetes deployment options The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations. ## Features - **OpenAI Compatible**: API endpoints match OpenAI's format for easy integration - **Text Embeddings**: Generate high-quality text embeddings using FastEmbed - **Text Generation**: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants) - **Performance Optimized**: Efficient caching and platform-specific optimizations for improved throughput - **Web Chat Interface**: Leptos-based WebAssembly (WASM) chat interface for browser-based interaction - **Flexible Deployment**: Run as monolithic service or microservices architecture ## Architecture Overview ### Workspace Structure The project uses a 7-crate Rust workspace plus TypeScript components: ``` crates/ ├── predict-otron-9000/ # Main orchestration server (Rust 2024) ├── inference-engine/ # Multi-model inference orchestrator (Rust 2021) ├── gemma-runner/ # Gemma model inference via Candle (Rust 2021) ├── llama-runner/ # Llama model inference via Candle (Rust 2021) ├── embeddings-engine/ # FastEmbed embeddings service (Rust 2024) ├── leptos-app/ # WASM web frontend (Rust 2021) ├── helm-chart-tool/ # Kubernetes deployment tooling (Rust 2024) └── scripts/ └── cli.ts # TypeScript/Bun CLI client ``` ### Service Architecture - **Main Server** (port 8080): Orchestrates inference and embeddings services - **Embeddings Service** (port 8080): Standalone FastEmbed service with OpenAI API compatibility - **Web Frontend** (port 8788): Leptos WASM chat interface served by Trunk - **CLI Client**: TypeScript/Bun client for testing and automation ### Deployment Modes The architecture supports multiple deployment patterns: 1. **Development Mode**: All services run in a single process for simplified development 2. **Docker Monolithic**: Single containerized service handling all functionality 3. **Kubernetes Microservices**: Separate services for horizontal scalability and fault isolation ## Build and Configuration ### Dependencies and Environment Prerequisites #### Rust Toolchain - **Editions**: Mixed - main services use Rust 2024, some components use 2021 - **Recommended**: Latest stable Rust toolchain: `rustup default stable && rustup update` - **Developer tools**: - `rustup component add rustfmt` (formatting) - `rustup component add clippy` (linting) #### Node.js/Bun Toolchain - **Bun**: Required for TypeScript CLI client: `curl -fsSL https://bun.sh/install | bash` - **Node.js**: Alternative to Bun, supports OpenAI SDK v5.16.0+ #### WASM Frontend Toolchain - **Trunk**: Required for Leptos frontend builds: `cargo install trunk` - **wasm-pack**: `cargo install wasm-pack` - **WASM target**: `rustup target add wasm32-unknown-unknown` #### ML Framework Dependencies - **Candle**: Version 0.9.1 with conditional compilation: - macOS: Metal support with CPU fallback for stability - Linux: CUDA support with CPU fallback - CPU-only: Supported on all platforms - **FastEmbed**: Version 4.x for embeddings functionality #### Hugging Face Access - **Required for**: Gemma model downloads (gated models) - **Authentication**: - CLI: `pip install -U "huggingface_hub[cli]" && huggingface-cli login` - Environment: `export HF_TOKEN=""` - **Cache management**: `export HF_HOME="$PWD/.hf-cache"` (optional, keeps cache local) - **Model access**: Accept Gemma model licenses on Hugging Face before use #### Platform-Specific Notes - **macOS**: Metal acceleration available but routed to CPU for Gemma v3 stability - **Linux**: CUDA support with BF16 precision on GPU, F32 on CPU - **Conditional compilation**: Handled automatically per platform in Cargo.toml ### Build Procedures #### Full Workspace Build ```bash cargo build --workspace --release ``` #### Individual Services **Main Server:** ```bash cargo build --bin predict-otron-9000 --release ``` **Inference Engine CLI:** ```bash cargo build --bin cli --package inference-engine --release ``` **Embeddings Service:** ```bash cargo build --bin embeddings-engine --release ``` **Web Frontend:** ```bash cd crates/leptos-app trunk build --release ``` ### Running Services #### Main Server (Port 8080) ```bash ./scripts/run_server.sh ``` - Respects `SERVER_PORT` (default: 8080) and `RUST_LOG` (default: info) - Boots with default model: `gemma-3-1b-it` - Requires HF authentication for first-time model download #### Web Frontend (Port 8788) ```bash cd crates/leptos-app ./run.sh ``` - Serves Leptos WASM frontend on port 8788 - Sets required RUSTFLAGS for WebAssembly getrandom support - Auto-reloads during development #### TypeScript CLI Client ```bash # List available models bun run scripts/cli.ts --list-models # Chat completion bun run scripts/cli.ts "What is the capital of France?" # With specific model bun run scripts/cli.ts --model gemma-3-1b-it --prompt "Hello, world!" # Show help bun run scripts/cli.ts --help ``` ## API Usage ### Health Checks and Model Inventory ```bash curl -s http://localhost:8080/v1/models | jq ``` ### Chat Completions **Non-streaming:** ```bash curl -s http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 64 }' | jq ``` **Streaming (Server-Sent Events):** ```bash curl -N http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{"role": "user", "content": "Tell a short joke"}], "stream": true, "max_tokens": 64 }' ``` **Model Specification:** - Use `"model": "default"` for configured model - Or specify exact model ID: `"model": "gemma-3-1b-it"` - Requests with unknown models will be rejected ### Embeddings API Generate text embeddings compatible with OpenAI's embeddings API. **Endpoint**: `POST /v1/embeddings` **Request Body**: ```json { "input": "Your text to embed", "model": "nomic-embed-text-v1.5" } ``` **Response**: ```json { "object": "list", "data": [ { "object": "embedding", "index": 0, "embedding": [0.1, 0.2, 0.3] } ], "model": "nomic-embed-text-v1.5", "usage": { "prompt_tokens": 0, "total_tokens": 0 } } ``` ### Web Frontend - Navigate to `http://localhost:8788` - Real-time chat interface with the inference server - Supports streaming responses and conversation history ## Testing ### Test Categories 1. **Offline/fast tests**: No network or model downloads required 2. **Online tests**: Require HF authentication and model downloads 3. **Integration tests**: Multi-service end-to-end testing ### Quick Start: Offline Tests **Prompt formatting tests:** ```bash cargo test --workspace build_gemma_prompt ``` **Model metadata tests:** ```bash cargo test --workspace which_ ``` These verify core functionality without requiring HF access. ### Full Test Suite (Requires HF) **Prerequisites:** 1. Accept Gemma model licenses on Hugging Face 2. Authenticate: `huggingface-cli login` or `export HF_TOKEN=...` 3. Optional: `export HF_HOME="$PWD/.hf-cache"` **Run all tests:** ```bash cargo test --workspace ``` ### Integration Testing **End-to-end test script:** ```bash ./test.sh ``` This script: - Starts the server in background with proper cleanup - Waits for server readiness via health checks - Runs CLI tests for model listing and chat completion - Includes 60-second timeout and process management ## Development ### Code Style and Tooling **Formatting:** ```bash cargo fmt --all ``` **Linting:** ```bash cargo clippy --workspace --all-targets -- -D warnings ``` **Logging:** - Server uses `tracing` framework - Control via `RUST_LOG` (e.g., `RUST_LOG=debug ./scripts/run_server.sh`) ### Adding Tests **For fast, offline tests:** - Exercise pure logic without tokenizers/models - Use descriptive names for easy filtering: `cargo test specific_test_name` - Example patterns: prompt construction, metadata selection, tensor math **Process:** 1. Add test to existing module 2. Run filtered: `cargo test --workspace new_test_name` 3. Verify in full suite: `cargo test --workspace` ### OpenAI API Compatibility **Features:** - POST `/v1/chat/completions` with streaming and non-streaming - Single configured model enforcement (use `"model": "default"`) - Gemma-style prompt formatting with ``/`` markers - System prompt injection into first user turn - Repetition detection and early stopping in streaming mode **CORS:** - Fully open by default (`tower-http CorsLayer::Any`) - Adjust for production deployment ### Architecture Details **Device Selection:** - Automatic device/dtype selection - CPU: Universal fallback (F32 precision) - CUDA: BF16 precision on compatible GPUs - Metal: Available but routed to CPU for Gemma v3 stability **Model Loading:** - Single-file `model.safetensors` preferred - Falls back to index resolution via `utilities_lib::hub_load_safetensors` - HF cache populated on first access **Multi-Service Design:** - Main server orchestrates inference and embeddings - Services can run independently for horizontal scaling - Docker/Kubernetes metadata included for deployment ## Deployment ### Docker Support All services include Docker metadata in `Cargo.toml`: **Main Server:** - Image: `ghcr.io/geoffsee/predict-otron-9000:latest` - Port: 8080 **Inference Service:** - Image: `ghcr.io/geoffsee/inference-service:latest` - Port: 8080 **Embeddings Service:** - Image: `ghcr.io/geoffsee/embeddings-service:latest` - Port: 8080 **Web Frontend:** - Image: `ghcr.io/geoffsee/leptos-app:latest` - Port: 8788 **Docker Compose:** ```bash # Start all services docker-compose up -d # Check logs docker-compose logs -f # Stop services docker-compose down ``` ### Kubernetes Support All services include Kubernetes manifest metadata: - Single replica deployments by default - Service-specific port configurations - Ready for horizontal pod autoscaling For Kubernetes deployment details, see the [ARCHITECTURE.md](docs/ARCHITECTURE.md) document. ### Build Artifacts **Ignored by Git:** - `target/` (Rust build artifacts) - `node_modules/` (Node.js dependencies) - `dist/` (Frontend build output) - `.fastembed_cache/` (FastEmbed model cache) - `.hf-cache/` (Hugging Face cache, if configured) ## Common Issues and Solutions ### Authentication/Licensing **Symptom:** 404 or permission errors fetching models **Solution:** 1. Accept Gemma model licenses on Hugging Face 2. Authenticate with `huggingface-cli login` or `HF_TOKEN` 3. Verify token with `huggingface-cli whoami` ### GPU Issues **Symptom:** OOM errors or GPU panics **Solution:** 1. Test on CPU first: ensure `CUDA_VISIBLE_DEVICES=""` if needed 2. Check available VRAM vs model requirements 3. Consider using smaller model variants ### Model Mismatch Errors **Symptom:** 400 errors with `type=model_mismatch` **Solution:** - Use `"model": "default"` in API requests - Or match configured model ID exactly: `"model": "gemma-3-1b-it"` ### Frontend Build Issues **Symptom:** WASM compilation failures **Solution:** 1. Install required targets: `rustup target add wasm32-unknown-unknown` 2. Install trunk: `cargo install trunk` 3. Check RUSTFLAGS in leptos-app/run.sh ### Network/Timeout Issues **Symptom:** First-time model downloads timing out **Solution:** 1. Ensure stable internet connection 2. Consider using local HF cache: `export HF_HOME="$PWD/.hf-cache"` 3. Download models manually with `huggingface-cli` ## Minimal End-to-End Verification **Build verification:** ```bash cargo build --workspace --release ``` **Fast offline tests:** ```bash cargo test --workspace build_gemma_prompt cargo test --workspace which_ ``` **Service startup:** ```bash ./scripts/run_server.sh & sleep 10 # Wait for server startup curl -s http://localhost:8080/v1/models | jq ``` **CLI client test:** ```bash bun run scripts/cli.ts "What is 2+2?" ``` **Web frontend:** ```bash cd crates/leptos-app && ./run.sh & # Navigate to http://localhost:8788 ``` **Integration test:** ```bash ./test.sh ``` **Cleanup:** ```bash pkill -f "predict-otron-9000" pkill -f "trunk" ``` For networked tests and full functionality, ensure Hugging Face authentication is configured as described above. ## Further Reading ### Documentation - [Architecture](docs/ARCHITECTURE.md) - Detailed architectural diagrams and deployment patterns - [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options - [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide - [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for benchmarking ## Contributing 1. Fork the repository 2. Create a feature branch: `git checkout -b feature-name` 3. Make your changes and add tests 4. Ensure all tests pass: `cargo test` 5. Submit a pull request _Warning: Do NOT use this in production unless you are cool like that._