mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Go to file

geoffsee 315ef17605 supports small llama and gemma models

Refactor inference

dedicated crates for llama and gemma inferencing, not integrated

2025-08-29 20:00:41 -04:00

.cargo

Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

2025-08-27 16:15:01 -04:00

crates

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

docs

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

integration

update docs

2025-08-28 12:54:09 -04:00

scripts

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

.aiignore

Remove ROOT_CAUSE_ANALYSIS.md and outdated server logs

2025-08-28 08:26:18 -04:00

.gitignore

Add Rust-based Helm Chart Generator Tool

2025-08-28 08:39:54 -04:00

Cargo.lock

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

Cargo.toml

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

docker-compose.yml

predict-otron-9000 serves a leptos SSR frontend

2025-08-28 12:06:22 -04:00

package.json

update docs

2025-08-28 12:54:09 -04:00

README.md

supports small llama and gemma models

2025-08-29 20:00:41 -04:00

README.md

predict-otron-9000

A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

Powerful local AI inference with OpenAI-compatible APIs

Project Overview

The predict-otron-9000 is a flexible AI platform that provides:

Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
Embeddings Generation: Create text embeddings with FastEmbed
Web Interface: Interact with models through a Leptos WASM chat interface
TypeScript CLI: Command-line client for testing and automation
Production Deployment: Docker and Kubernetes deployment options

The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.

Features

OpenAI Compatible: API endpoints match OpenAI's format for easy integration
Text Embeddings: Generate high-quality text embeddings using FastEmbed
Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
Web Chat Interface: Leptos-based WebAssembly (WASM) chat interface for browser-based interaction
Flexible Deployment: Run as monolithic service or microservices architecture

Architecture Overview

Workspace Structure

The project uses a 7-crate Rust workspace plus TypeScript components:

crates/
├── predict-otron-9000/     # Main orchestration server (Rust 2024)
├── inference-engine/       # Multi-model inference orchestrator (Rust 2021)
├── gemma-runner/          # Gemma model inference via Candle (Rust 2021)
├── llama-runner/          # Llama model inference via Candle (Rust 2021)
├── embeddings-engine/     # FastEmbed embeddings service (Rust 2024)
├── leptos-app/            # WASM web frontend (Rust 2021)
├── helm-chart-tool/       # Kubernetes deployment tooling (Rust 2024)
└── scripts/
    └── cli.ts             # TypeScript/Bun CLI client

Service Architecture

Main Server (port 8080): Orchestrates inference and embeddings services
Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
Web Frontend (port 8788): Leptos WASM chat interface served by Trunk
CLI Client: TypeScript/Bun client for testing and automation

Deployment Modes

The architecture supports multiple deployment patterns:

Development Mode: All services run in a single process for simplified development
Docker Monolithic: Single containerized service handling all functionality
Kubernetes Microservices: Separate services for horizontal scalability and fault isolation

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

Editions: Mixed - main services use Rust 2024, some components use 2021
Recommended: Latest stable Rust toolchain: rustup default stable && rustup update
Developer tools:
- rustup component add rustfmt (formatting)
- rustup component add clippy (linting)

Node.js/Bun Toolchain

Bun: Required for TypeScript CLI client: curl -fsSL https://bun.sh/install | bash
Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+

WASM Frontend Toolchain

Trunk: Required for Leptos frontend builds: cargo install trunk
wasm-pack: cargo install wasm-pack
WASM target: rustup target add wasm32-unknown-unknown

ML Framework Dependencies

Candle: Version 0.9.1 with conditional compilation:
- macOS: Metal support with CPU fallback for stability
- Linux: CUDA support with CPU fallback
- CPU-only: Supported on all platforms
FastEmbed: Version 4.x for embeddings functionality

Hugging Face Access

Required for: Gemma model downloads (gated models)
Authentication:
- CLI: pip install -U "huggingface_hub[cli]" && huggingface-cli login
- Environment: export HF_TOKEN="<your_token>"
Cache management: export HF_HOME="$PWD/.hf-cache" (optional, keeps cache local)
Model access: Accept Gemma model licenses on Hugging Face before use

Platform-Specific Notes

macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
Linux: CUDA support with BF16 precision on GPU, F32 on CPU
Conditional compilation: Handled automatically per platform in Cargo.toml

Build Procedures

Full Workspace Build

cargo build --workspace --release

Individual Services

Main Server:

cargo build --bin predict-otron-9000 --release

Inference Engine CLI:

cargo build --bin cli --package inference-engine --release

Embeddings Service:

cargo build --bin embeddings-engine --release

Web Frontend:

cd crates/leptos-app
trunk build --release

Running Services

Main Server (Port 8080)

./scripts/run_server.sh

Respects SERVER_PORT (default: 8080) and RUST_LOG (default: info)
Boots with default model: gemma-3-1b-it
Requires HF authentication for first-time model download

Web Frontend (Port 8788)

cd crates/leptos-app
./run.sh

Serves Leptos WASM frontend on port 8788
Sets required RUSTFLAGS for WebAssembly getrandom support
Auto-reloads during development

TypeScript CLI Client

# List available models
bun run scripts/cli.ts --list-models

# Chat completion
bun run scripts/cli.ts "What is the capital of France?"

# With specific model
bun run scripts/cli.ts --model gemma-3-1b-it --prompt "Hello, world!"

# Show help
bun run scripts/cli.ts --help

API Usage

Health Checks and Model Inventory

curl -s http://localhost:8080/v1/models | jq

Chat Completions

Non-streaming:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 64
  }' | jq

Streaming (Server-Sent Events):

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default", 
    "messages": [{"role": "user", "content": "Tell a short joke"}],
    "stream": true,
    "max_tokens": 64
  }'

Model Specification:

Use "model": "default" for configured model
Or specify exact model ID: "model": "gemma-3-1b-it"
Requests with unknown models will be rejected

Embeddings API

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Web Frontend

Navigate to http://localhost:8788
Real-time chat interface with the inference server
Supports streaming responses and conversation history

Testing

Test Categories

Offline/fast tests: No network or model downloads required
Online tests: Require HF authentication and model downloads
Integration tests: Multi-service end-to-end testing

Quick Start: Offline Tests

Prompt formatting tests:

cargo test --workspace build_gemma_prompt

Model metadata tests:

cargo test --workspace which_

These verify core functionality without requiring HF access.

Full Test Suite (Requires HF)

Prerequisites:

Accept Gemma model licenses on Hugging Face
Authenticate: huggingface-cli login or export HF_TOKEN=...
Optional: export HF_HOME="$PWD/.hf-cache"

Run all tests:

cargo test --workspace

Integration Testing

End-to-end test script:

./test.sh

This script:

Starts the server in background with proper cleanup
Waits for server readiness via health checks
Runs CLI tests for model listing and chat completion
Includes 60-second timeout and process management

Development

Code Style and Tooling

Formatting:

cargo fmt --all

Linting:

cargo clippy --workspace --all-targets -- -D warnings

Logging:

Server uses tracing framework
Control via RUST_LOG (e.g., RUST_LOG=debug ./scripts/run_server.sh)

Adding Tests

For fast, offline tests:

Exercise pure logic without tokenizers/models
Use descriptive names for easy filtering: cargo test specific_test_name
Example patterns: prompt construction, metadata selection, tensor math

Process:

Add test to existing module
Run filtered: cargo test --workspace new_test_name
Verify in full suite: cargo test --workspace

OpenAI API Compatibility

Features:

POST /v1/chat/completions with streaming and non-streaming
Single configured model enforcement (use "model": "default")
Gemma-style prompt formatting with <start_of_turn>/<end_of_turn> markers
System prompt injection into first user turn
Repetition detection and early stopping in streaming mode

CORS:

Fully open by default (tower-http CorsLayer::Any)
Adjust for production deployment

Architecture Details

Device Selection:

Automatic device/dtype selection
CPU: Universal fallback (F32 precision)
CUDA: BF16 precision on compatible GPUs
Metal: Available but routed to CPU for Gemma v3 stability

Model Loading:

Single-file model.safetensors preferred
Falls back to index resolution via utilities_lib::hub_load_safetensors
HF cache populated on first access

Multi-Service Design:

Main server orchestrates inference and embeddings
Services can run independently for horizontal scaling
Docker/Kubernetes metadata included for deployment

Deployment

Docker Support

All services include Docker metadata in Cargo.toml:

Main Server:

Image: ghcr.io/geoffsee/predict-otron-9000:latest
Port: 8080

Inference Service:

Image: ghcr.io/geoffsee/inference-service:latest
Port: 8080

Embeddings Service:

Image: ghcr.io/geoffsee/embeddings-service:latest
Port: 8080

Web Frontend:

Image: ghcr.io/geoffsee/leptos-app:latest
Port: 8788

Docker Compose:

# Start all services
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Support

All services include Kubernetes manifest metadata:

Single replica deployments by default
Service-specific port configurations
Ready for horizontal pod autoscaling

For Kubernetes deployment details, see the ARCHITECTURE.md document.

Build Artifacts

Ignored by Git:

target/ (Rust build artifacts)
node_modules/ (Node.js dependencies)
dist/ (Frontend build output)
.fastembed_cache/ (FastEmbed model cache)
.hf-cache/ (Hugging Face cache, if configured)

Common Issues and Solutions

Authentication/Licensing

Symptom: 404 or permission errors fetching models
Solution:

Accept Gemma model licenses on Hugging Face
Authenticate with huggingface-cli login or HF_TOKEN
Verify token with huggingface-cli whoami

GPU Issues

Symptom: OOM errors or GPU panics
Solution:

Test on CPU first: ensure CUDA_VISIBLE_DEVICES="" if needed
Check available VRAM vs model requirements
Consider using smaller model variants

Model Mismatch Errors

Symptom: 400 errors with type=model_mismatch
Solution:

Use "model": "default" in API requests
Or match configured model ID exactly: "model": "gemma-3-1b-it"

Frontend Build Issues

Symptom: WASM compilation failures
Solution:

Install required targets: rustup target add wasm32-unknown-unknown
Install trunk: cargo install trunk
Check RUSTFLAGS in leptos-app/run.sh

Network/Timeout Issues

Symptom: First-time model downloads timing out
Solution:

Ensure stable internet connection
Consider using local HF cache: export HF_HOME="$PWD/.hf-cache"
Download models manually with huggingface-cli

Minimal End-to-End Verification

Build verification:

cargo build --workspace --release

Fast offline tests:

cargo test --workspace build_gemma_prompt
cargo test --workspace which_

Service startup:

./scripts/run_server.sh &
sleep 10  # Wait for server startup
curl -s http://localhost:8080/v1/models | jq

CLI client test:

bun run scripts/cli.ts "What is 2+2?"

Web frontend:

cd crates/leptos-app && ./run.sh &
# Navigate to http://localhost:8788

Integration test:

./test.sh

Cleanup:

pkill -f "predict-otron-9000"
pkill -f "trunk"

For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: cargo test
Submit a pull request

Warning: Do NOT use this in production unless you are cool like that.

Languages

Rust 87.1%

TypeScript 5.8%

Shell 4.7%

SCSS 1.9%

Dockerfile 0.5%

README.md

predict-otron-9000

Project Overview

Features

Architecture Overview

Workspace Structure

Service Architecture

Deployment Modes

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

Node.js/Bun Toolchain

WASM Frontend Toolchain

ML Framework Dependencies

Hugging Face Access

Platform-Specific Notes

Build Procedures

Full Workspace Build

Individual Services

Running Services

Main Server (Port 8080)

Web Frontend (Port 8788)

TypeScript CLI Client

API Usage

Health Checks and Model Inventory

Chat Completions

Embeddings API

Web Frontend

Testing

Test Categories

Quick Start: Offline Tests

Full Test Suite (Requires HF)

Integration Testing

Development

Code Style and Tooling

Adding Tests

OpenAI API Compatibility

Architecture Details

Deployment

Docker Support

Kubernetes Support

Build Artifacts

Common Issues and Solutions

Authentication/Licensing

GPU Issues

Model Mismatch Errors

Frontend Build Issues

Network/Timeout Issues

Minimal End-to-End Verification

Further Reading

Documentation

Contributing