From 62dcc8f5bba7bcbebe386a45ad03748f3f80eef0 Mon Sep 17 00:00:00 2001 From: geoffsee <> Date: Thu, 28 Aug 2025 16:04:38 -0400 Subject: [PATCH] ai generated README.md --- README.md | 622 ++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 394 insertions(+), 228 deletions(-) diff --git a/README.md b/README.md index 627c6bb..36aa6ac 100644 --- a/README.md +++ b/README.md @@ -1,114 +1,204 @@ # predict-otron-9000 -_Warning: Do NOT use this in production unless you are cool like that._ +A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

-Aliens, in a native executable. +Powerful local AI inference with OpenAI-compatible APIs

+## Project Overview + +The predict-otron-9000 is a flexible AI platform that provides: + +- **Local LLM Inference**: Run Gemma models locally with CPU or GPU acceleration +- **Embeddings Generation**: Create text embeddings with FastEmbed +- **Web Interface**: Interact with models through a Leptos WASM chat interface +- **TypeScript CLI**: Command-line client for testing and automation +- **Production Deployment**: Docker and Kubernetes deployment options + +The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations. ## Features + - **OpenAI Compatible**: API endpoints match OpenAI's format for easy integration -- **Text Embeddings**: Generate high-quality text embeddings using the Nomic Embed Text v1.5 model -- **Text Generation**: Chat completions with OpenAI-compatible API using Gemma models (1B, 2B, 7B, 9B variants including base and instruction-tuned models) -- **Performance Optimized**: Implements efficient caching and singleton patterns for improved throughput and reduced latency -- **Performance Benchmarking**: Includes tools for measuring performance and generating HTML reports -- **Web Chat Interface**: A Leptos-based WebAssembly (WASM) chat interface for browser-based interaction with the inference engine +- **Text Embeddings**: Generate high-quality text embeddings using FastEmbed +- **Text Generation**: Chat completions with OpenAI-compatible API using Gemma models (1B, 2B, 7B variants including instruction-tuned models) +- **Performance Optimized**: Efficient caching and platform-specific optimizations for improved throughput +- **Web Chat Interface**: Leptos-based WebAssembly (WASM) chat interface for browser-based interaction +- **Flexible Deployment**: Run as monolithic service or microservices architecture -## Architecture +## Architecture Overview -### Core Components +### Workspace Structure -- **`predict-otron-9000`**: Main unified server that combines both engines -- **`embeddings-engine`**: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model -- **`inference-engine`**: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers -- **`leptos-app`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine +The project uses a 4-crate Rust workspace plus TypeScript components: -## Further Reading - -### Documentation - -- [Architecture](docs/ARCHITECTURE.md) - Detailed server configuration options and deployment modes -- [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options and deployment modes -- [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide including unit, integration and e2e tests -- [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for running and analyzing performance benchmarks - -## Installation - -### Prerequisites - -- Rust 1.70+ with 2024 edition support -- Cargo package manager - -### Build from Source -```shell -# 1. Clone the repository -git clone -cd predict-otron-9000 - -# 2. Build the project -cargo build --release - -# 3. Run the unified server -./run_server.sh - -# Alternative: Build and run individual components -# For inference engine only: -cargo run -p inference-engine --release -- --server --port 3777 -# For embeddings engine only: -cargo run -p embeddings-engine --release +``` +crates/ +├── predict-otron-9000/ # Main orchestration server (Rust 2024) +├── inference-engine/ # Gemma inference via Candle (Rust 2021) +├── embeddings-engine/ # FastEmbed embeddings service (Rust 2024) +└── leptos-app/ # WASM web frontend (Rust 2021) +cli.ts # TypeScript/Bun CLI client ``` -## Usage +### Service Architecture -### Starting the Server +- **Main Server** (port 8080): Orchestrates inference and embeddings services +- **Embeddings Service** (port 8080): Standalone FastEmbed service with OpenAI API compatibility +- **Web Frontend** (port 8788): Leptos WASM chat interface served by Trunk +- **CLI Client**: TypeScript/Bun client for testing and automation -The server can be started using the provided script or directly with cargo: +### Deployment Modes -```shell -# Using the provided script -./run_server.sh +The architecture supports multiple deployment patterns: -# Or directly with cargo -cargo run --bin predict-otron-9000 +1. **Development Mode**: All services run in a single process for simplified development +2. **Docker Monolithic**: Single containerized service handling all functionality +3. **Kubernetes Microservices**: Separate services for horizontal scalability and fault isolation + +## Build and Configuration + +### Dependencies and Environment Prerequisites + +#### Rust Toolchain +- **Editions**: Mixed - main services use Rust 2024, some components use 2021 +- **Recommended**: Latest stable Rust toolchain: `rustup default stable && rustup update` +- **Developer tools**: + - `rustup component add rustfmt` (formatting) + - `rustup component add clippy` (linting) + +#### Node.js/Bun Toolchain +- **Bun**: Required for TypeScript CLI client: `curl -fsSL https://bun.sh/install | bash` +- **Node.js**: Alternative to Bun, supports OpenAI SDK v5.16.0+ + +#### WASM Frontend Toolchain +- **Trunk**: Required for Leptos frontend builds: `cargo install trunk` +- **wasm-pack**: `cargo install wasm-pack` +- **WASM target**: `rustup target add wasm32-unknown-unknown` + +#### ML Framework Dependencies +- **Candle**: Version 0.9.1 with conditional compilation: + - macOS: Metal support with CPU fallback for stability + - Linux: CUDA support with CPU fallback + - CPU-only: Supported on all platforms +- **FastEmbed**: Version 4.x for embeddings functionality + +#### Hugging Face Access +- **Required for**: Gemma model downloads (gated models) +- **Authentication**: + - CLI: `pip install -U "huggingface_hub[cli]" && huggingface-cli login` + - Environment: `export HF_TOKEN=""` +- **Cache management**: `export HF_HOME="$PWD/.hf-cache"` (optional, keeps cache local) +- **Model access**: Accept Gemma model licenses on Hugging Face before use + +#### Platform-Specific Notes +- **macOS**: Metal acceleration available but routed to CPU for Gemma v3 stability +- **Linux**: CUDA support with BF16 precision on GPU, F32 on CPU +- **Conditional compilation**: Handled automatically per platform in Cargo.toml + +### Build Procedures + +#### Full Workspace Build +```bash +cargo build --workspace --release ``` -### Configuration +#### Individual Services -Environment variables for server configuration: - -- `SERVER_HOST`: Server bind address (default: `0.0.0.0`) -- `SERVER_PORT`: Server port (default: `8080`) -- `SERVER_CONFIG`: JSON configuration for deployment mode (default: Local mode) -- `RUST_LOG`: Logging level configuration - -#### Deployment Modes - -The server supports two deployment modes controlled by `SERVER_CONFIG`: - -**Local Mode (default)**: Runs inference and embeddings services locally -```shell -./run_server.sh +**Main Server:** +```bash +cargo build --bin predict-otron-9000 --release ``` -**HighAvailability Mode**: Proxies requests to external services -```shell -export SERVER_CONFIG='{"serverMode": "HighAvailability"}' -./run_server.sh +**Inference Engine CLI:** +```bash +cargo build --bin cli --package inference-engine --release ``` -See [docs/SERVER_CONFIG.md](docs/SERVER_CONFIG.md) for complete configuration options, Docker Compose, and Kubernetes examples. - -#### Basic Configuration Example: -```shell -export SERVER_PORT=3000 -export RUST_LOG=debug -./run_server.sh +**Embeddings Service:** +```bash +cargo build --bin embeddings-engine --release ``` -## API Endpoints +**Web Frontend:** +```bash +cd crates/leptos-app +trunk build --release +``` -### Text Embeddings +### Running Services + +#### Main Server (Port 8080) +```bash +./scripts/run_server.sh +``` +- Respects `SERVER_PORT` (default: 8080) and `RUST_LOG` (default: info) +- Boots with default model: `gemma-3-1b-it` +- Requires HF authentication for first-time model download + +#### Web Frontend (Port 8788) +```bash +cd crates/leptos-app +./run.sh +``` +- Serves Leptos WASM frontend on port 8788 +- Sets required RUSTFLAGS for WebAssembly getrandom support +- Auto-reloads during development + +#### TypeScript CLI Client +```bash +# List available models +bun run cli.ts --list-models + +# Chat completion +bun run cli.ts "What is the capital of France?" + +# With specific model +bun run cli.ts --model gemma-3-1b-it --prompt "Hello, world!" + +# Show help +bun run cli.ts --help +``` + +## API Usage + +### Health Checks and Model Inventory +```bash +curl -s http://localhost:8080/v1/models | jq +``` + +### Chat Completions + +**Non-streaming:** +```bash +curl -s http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "default", + "messages": [{"role": "user", "content": "Say hello"}], + "max_tokens": 64 + }' | jq +``` + +**Streaming (Server-Sent Events):** +```bash +curl -N http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "default", + "messages": [{"role": "user", "content": "Tell a short joke"}], + "stream": true, + "max_tokens": 64 + }' +``` + +**Model Specification:** +- Use `"model": "default"` for configured model +- Or specify exact model ID: `"model": "gemma-3-1b-it"` +- Requests with unknown models will be rejected + +### Embeddings API Generate text embeddings compatible with OpenAI's embeddings API. @@ -141,142 +231,259 @@ Generate text embeddings compatible with OpenAI's embeddings API. } ``` -### Chat Completions +### Web Frontend +- Navigate to `http://localhost:8788` +- Real-time chat interface with the inference server +- Supports streaming responses and conversation history -Generate chat completions (simplified implementation). +## Testing -**Endpoint**: `POST /v1/chat/completions` +### Test Categories -**Request Body**: -```json -{ - "model": "gemma-2b-it", - "messages": [ - { - "role": "user", - "content": "Hello, how are you?" - } - ] -} +1. **Offline/fast tests**: No network or model downloads required +2. **Online tests**: Require HF authentication and model downloads +3. **Integration tests**: Multi-service end-to-end testing + +### Quick Start: Offline Tests + +**Prompt formatting tests:** +```bash +cargo test --workspace build_gemma_prompt ``` -**Response**: -```json -{ - "id": "chatcmpl-...", - "object": "chat.completion", - "created": 1699123456, - "model": "gemma-2b-it", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "Hello! This is the unified predict-otron-9000 server..." - }, - "finish_reason": "stop" - } - ], - "usage": { - "prompt_tokens": 10, - "completion_tokens": 35, - "total_tokens": 45 - } -} +**Model metadata tests:** +```bash +cargo test --workspace which_ ``` -### Health Check +These verify core functionality without requiring HF access. -**Endpoint**: `GET /` +### Full Test Suite (Requires HF) -Returns a simple "Hello, World!" message to verify the server is running. +**Prerequisites:** +1. Accept Gemma model licenses on Hugging Face +2. Authenticate: `huggingface-cli login` or `export HF_TOKEN=...` +3. Optional: `export HF_HOME="$PWD/.hf-cache"` + +**Run all tests:** +```bash +cargo test --workspace +``` + +### Integration Testing + +**End-to-end test script:** +```bash +./test.sh +``` + +This script: +- Starts the server in background with proper cleanup +- Waits for server readiness via health checks +- Runs CLI tests for model listing and chat completion +- Includes 60-second timeout and process management ## Development -### Project Structure +### Code Style and Tooling -``` -predict-otron-9000/ -├── Cargo.toml # Workspace configuration -├── README.md # This file -├── run_server.sh # Server startup script -└── crates/ - ├── predict-otron-9000/ # Main unified server - │ ├── Cargo.toml - │ └── src/ - │ └── main.rs - ├── embeddings-engine/ # Text embeddings functionality - │ ├── Cargo.toml - │ └── src/ - │ ├── lib.rs - │ └── main.rs - └── inference-engine/ # Text generation functionality - ├── Cargo.toml - ├── src/ - │ ├── lib.rs - │ ├── cli.rs - │ ├── server.rs - │ ├── model.rs - │ ├── text_generation.rs - │ ├── token_output_stream.rs - │ ├── utilities_lib.rs - │ └── openai_types.rs - └── tests/ +**Formatting:** +```bash +cargo fmt --all ``` -### Running Tests - -```shell -# Run all tests -cargo test - -# Run tests for a specific crate -cargo test -p embeddings-engine -cargo test -p inference-engine +**Linting:** +```bash +cargo clippy --workspace --all-targets -- -D warnings ``` -For comprehensive testing documentation, including unit tests, integration tests, end-to-end tests, and performance testing, please refer to the [TESTING.md](docs/TESTING.md) document. +**Logging:** +- Server uses `tracing` framework +- Control via `RUST_LOG` (e.g., `RUST_LOG=debug ./scripts/run_server.sh`) -For performance benchmarking with HTML report generation, see the [BENCHMARKING.md](BENCHMARKING.md) guide. +### Adding Tests -### Adding Features +**For fast, offline tests:** +- Exercise pure logic without tokenizers/models +- Use descriptive names for easy filtering: `cargo test specific_test_name` +- Example patterns: prompt construction, metadata selection, tensor math -1. **Embeddings Engine**: Modify `crates/embeddings-engine/src/lib.rs` to add new embedding models or functionality -2. **Inference Engine**: The inference engine has a modular structure - add new models in the `model.rs` module -3. **Unified Server**: Update `crates/predict-otron-9000/src/main.rs` to integrate new capabilities +**Process:** +1. Add test to existing module +2. Run filtered: `cargo test --workspace new_test_name` +3. Verify in full suite: `cargo test --workspace` -## Logging and Debugging +### OpenAI API Compatibility -The application uses structured logging with tracing. Log levels can be controlled via the `RUST_LOG` environment variable: +**Features:** +- POST `/v1/chat/completions` with streaming and non-streaming +- Single configured model enforcement (use `"model": "default"`) +- Gemma-style prompt formatting with ``/`` markers +- System prompt injection into first user turn +- Repetition detection and early stopping in streaming mode -```shell -# Debug level logging -export RUST_LOG=debug +**CORS:** +- Fully open by default (`tower-http CorsLayer::Any`) +- Adjust for production deployment -# Trace level for detailed embeddings debugging -export RUST_LOG=trace +### Architecture Details -# Module-specific logging -export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace +**Device Selection:** +- Automatic device/dtype selection +- CPU: Universal fallback (F32 precision) +- CUDA: BF16 precision on compatible GPUs +- Metal: Available but routed to CPU for Gemma v3 stability + +**Model Loading:** +- Single-file `model.safetensors` preferred +- Falls back to index resolution via `utilities_lib::hub_load_safetensors` +- HF cache populated on first access + +**Multi-Service Design:** +- Main server orchestrates inference and embeddings +- Services can run independently for horizontal scaling +- Docker/Kubernetes metadata included for deployment + +## Deployment + +### Docker Support + +All services include Docker metadata in `Cargo.toml`: + +**Main Server:** +- Image: `ghcr.io/geoffsee/predict-otron-9000:latest` +- Port: 8080 + +**Inference Service:** +- Image: `ghcr.io/geoffsee/inference-service:latest` +- Port: 8080 + +**Embeddings Service:** +- Image: `ghcr.io/geoffsee/embeddings-service:latest` +- Port: 8080 + +**Web Frontend:** +- Image: `ghcr.io/geoffsee/leptos-app:latest` +- Port: 8788 + +**Docker Compose:** +```bash +# Start all services +docker-compose up -d + +# Check logs +docker-compose logs -f + +# Stop services +docker-compose down ``` -### Usage +### Kubernetes Support -The chat interface connects to the inference engine API and provides a user-friendly way to interact with the AI models. To use: +All services include Kubernetes manifest metadata: +- Single replica deployments by default +- Service-specific port configurations +- Ready for horizontal pod autoscaling -1. Start the predict-otron-9000 server -2. Open the chat interface in a web browser -3. Enter messages and receive AI-generated responses +For Kubernetes deployment details, see the [ARCHITECTURE.md](docs/ARCHITECTURE.md) document. -The interface supports: -- Real-time messaging with the AI -- Visual indication of when the AI is generating a response -- Message history display +### Build Artifacts -## Limitations +**Ignored by Git:** +- `target/` (Rust build artifacts) +- `node_modules/` (Node.js dependencies) +- `dist/` (Frontend build output) +- `.fastembed_cache/` (FastEmbed model cache) +- `.hf-cache/` (Hugging Face cache, if configured) -- **Inference Engine**: Currently provides a simplified implementation for chat completions. Full model loading and text generation capabilities from the inference-engine crate are not yet integrated into the unified server. -- **Model Support**: Embeddings are limited to the Nomic Embed Text v1.5 model. +## Common Issues and Solutions + +### Authentication/Licensing +**Symptom:** 404 or permission errors fetching models +**Solution:** +1. Accept Gemma model licenses on Hugging Face +2. Authenticate with `huggingface-cli login` or `HF_TOKEN` +3. Verify token with `huggingface-cli whoami` + +### GPU Issues +**Symptom:** OOM errors or GPU panics +**Solution:** +1. Test on CPU first: ensure `CUDA_VISIBLE_DEVICES=""` if needed +2. Check available VRAM vs model requirements +3. Consider using smaller model variants + +### Model Mismatch Errors +**Symptom:** 400 errors with `type=model_mismatch` +**Solution:** +- Use `"model": "default"` in API requests +- Or match configured model ID exactly: `"model": "gemma-3-1b-it"` + +### Frontend Build Issues +**Symptom:** WASM compilation failures +**Solution:** +1. Install required targets: `rustup target add wasm32-unknown-unknown` +2. Install trunk: `cargo install trunk` +3. Check RUSTFLAGS in leptos-app/run.sh + +### Network/Timeout Issues +**Symptom:** First-time model downloads timing out +**Solution:** +1. Ensure stable internet connection +2. Consider using local HF cache: `export HF_HOME="$PWD/.hf-cache"` +3. Download models manually with `huggingface-cli` + +## Minimal End-to-End Verification + +**Build verification:** +```bash +cargo build --workspace --release +``` + +**Fast offline tests:** +```bash +cargo test --workspace build_gemma_prompt +cargo test --workspace which_ +``` + +**Service startup:** +```bash +./scripts/run_server.sh & +sleep 10 # Wait for server startup +curl -s http://localhost:8080/v1/models | jq +``` + +**CLI client test:** +```bash +bun run cli.ts "What is 2+2?" +``` + +**Web frontend:** +```bash +cd crates/leptos-app && ./run.sh & +# Navigate to http://localhost:8788 +``` + +**Integration test:** +```bash +./test.sh +``` + +**Cleanup:** +```bash +pkill -f "predict-otron-9000" +pkill -f "trunk" +``` + +For networked tests and full functionality, ensure Hugging Face authentication is configured as described above. + +## Further Reading + +### Documentation + +- [Architecture](docs/ARCHITECTURE.md) - Detailed architectural diagrams and deployment patterns +- [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options +- [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide +- [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for benchmarking ## Contributing @@ -286,45 +493,4 @@ The interface supports: 4. Ensure all tests pass: `cargo test` 5. Submit a pull request - -## Quick cURL verification for Chat Endpoints - -Start the unified server: - -``` -./run_server.sh -``` - -Non-streaming chat completion (expects JSON response): - -``` -curl -X POST http://localhost:8080/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "gemma-3-1b-it", - "messages": [ - {"role": "user", "content": "Who was the 16th president of the United States?"} - ], - "max_tokens": 128, - "stream": false - }' -``` - -Streaming chat completion via Server-Sent Events (SSE): - -``` -curl -N -X POST http://localhost:8080/v1/chat/completions/stream \ - -H "Content-Type: application/json" \ - -d '{ - "model": "gemma-3-1b-it", - "messages": [ - {"role": "user", "content": "Who was the 16th president of the United States?"} - ], - "max_tokens": 128, - "stream": true - }' -``` - -Helper scripts are also available: -- scripts/curl_chat.sh -- scripts/curl_chat_stream.sh +_Warning: Do NOT use this in production unless you are cool like that._ \ No newline at end of file