# Predict-Otron-9000 Architecture Documentation
This document provides comprehensive architectural diagrams for the Predict-Otron-9000 multi-service AI platform, showing all supported configurations and deployment patterns.
## Table of Contents
- [System Overview](#system-overview)
- [Workspace Structure](#workspace-structure)
- [Deployment Configurations](#deployment-configurations)
- [Development Mode](#development-mode)
- [Docker Monolithic](#docker-monolithic)
- [Kubernetes Microservices](#kubernetes-microservices)
- [Service Interactions](#service-interactions)
- [Platform-Specific Configurations](#platform-specific-configurations)
- [Data Flow Patterns](#data-flow-patterns)
## System Overview
The Predict-Otron-9000 is a comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces. The system supports flexible deployment patterns from monolithic to microservices architectures.
```mermaid
graph TB
subgraph "Core Components"
A[Main Server
predict-otron-9000]
B[Inference Engine
Gemma/Llama via Candle]
C[Embeddings Engine
FastEmbed]
D[Web Frontend
Leptos WASM]
end
subgraph "Client Interfaces"
E[TypeScript CLI
Bun/Node.js]
F[Web Browser
HTTP/WebSocket]
G[HTTP API Clients
OpenAI Compatible]
end
subgraph "Platform Support"
H[CPU Fallback
All Platforms]
I[CUDA Support
Linux GPU]
J[Metal Support
macOS GPU]
end
A --- B
A --- C
A --- D
E -.-> A
F -.-> A
G -.-> A
B --- H
B --- I
B --- J
```
## Workspace Structure
The project uses a 7-crate Rust workspace with TypeScript tooling, designed for maximum flexibility in deployment configurations.
```mermaid
graph TD
subgraph "Rust Workspace"
subgraph "Main Orchestrator"
A[predict-otron-9000
Edition: 2024
Port: 8080]
end
subgraph "AI Services"
B[inference-engine
Edition: 2021
Port: 8080
Multi-model orchestrator]
J[gemma-runner
Edition: 2021
Gemma via Candle]
K[llama-runner
Edition: 2021
Llama via Candle]
C[embeddings-engine
Edition: 2024
Port: 8080
FastEmbed]
end
subgraph "Frontend"
D[leptos-app
Edition: 2021
Port: 3000/8788
WASM/SSR]
end
subgraph "Tooling"
L[helm-chart-tool
Edition: 2024
K8s deployment]
end
end
subgraph "External Tooling"
E[scripts/cli.ts
TypeScript/Bun
OpenAI SDK]
end
subgraph "Dependencies"
A --> B
A --> C
A --> D
B --> J
B --> K
J -.-> F[Candle 0.9.1]
K -.-> F
C -.-> G[FastEmbed 4.x]
D -.-> H[Leptos 0.8.0]
E -.-> I[OpenAI SDK 5.16+]
end
style A fill:#e1f5fe
style B fill:#f3e5f5
style J fill:#f3e5f5
style K fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
style L fill:#fff9c4
```
## Deployment Configurations
### Development Mode
Local development runs all services integrated within the main server for simplicity.
```mermaid
graph LR
subgraph "Development Environment"
subgraph "Single Process - Port 8080"
A[predict-otron-9000 Server]
A --> B[Embedded Inference Engine]
A --> C[Embedded Embeddings Engine]
A --> D[SSR Leptos Frontend]
end
end
subgraph "External Clients"
F[CLI Client
cli.ts via Bun]
G[Web Browser]
H[HTTP API Clients]
end
F -.-> A
G -.-> A
G -.-> E
H -.-> A
style A fill:#e3f2fd
style E fill:#f1f8e9
```
### Docker Monolithic
Docker Compose runs a single containerized service handling all functionality.
```mermaid
graph TB
subgraph "Docker Environment"
subgraph "predict-otron-9000 Container"
A[Main Server :8080]
A --> B[Inference Engine
Library Mode]
A --> C[Embeddings Engine
Library Mode]
A --> D[Leptos Frontend
SSR Mode]
end
subgraph "Persistent Storage"
E[HF Cache Volume
/.hf-cache]
F[FastEmbed Cache Volume
/.fastembed_cache]
end
subgraph "Network"
G[predict-otron-network
Bridge Driver]
end
end
subgraph "External Access"
H[Host Port 8080]
I[External Clients]
end
A --- E
A --- F
A --- G
H --> A
I -.-> H
style A fill:#e8f5e8
style E fill:#fff3e0
style F fill:#fff3e0
```
### Kubernetes Microservices
Kubernetes deployment separates all services for horizontal scalability and fault isolation.
```mermaid
graph TB
subgraph "Kubernetes Namespace"
subgraph "Main Orchestrator"
A[predict-otron-9000 Pod
:8080
ClusterIP Service]
end
subgraph "AI Services"
B[inference-engine Pod
:8080
ClusterIP Service]
C[embeddings-engine Pod
:8080
ClusterIP Service]
end
subgraph "Frontend"
D[leptos-app Pod
:8788
ClusterIP Service]
end
subgraph "Ingress"
E[Ingress Controller
predict-otron-9000.local]
end
end
subgraph "External"
F[External Clients]
G[Container Registry
ghcr.io/geoffsee/*]
end
A <--> B
A <--> C
E --> A
E --> D
F -.-> E
G -.-> A
G -.-> B
G -.-> C
G -.-> D
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
```
## Service Interactions
### API Flow and Communication Patterns
```mermaid
sequenceDiagram
participant Client as External Client
participant Main as Main Server
(Port 8080)
participant Inf as Inference Engine
participant Emb as Embeddings Engine
participant Web as Web Frontend
Note over Client, Web: Development/Monolithic Mode
Client->>Main: POST /v1/chat/completions
Main->>Inf: Internal call (library)
Inf-->>Main: Generated response
Main-->>Client: Streaming/Non-streaming response
Client->>Main: POST /v1/embeddings
Main->>Emb: Internal call (library)
Emb-->>Main: Vector embeddings
Main-->>Client: Embeddings response
Note over Client, Web: Kubernetes Microservices Mode
Client->>Main: POST /v1/chat/completions
Main->>Inf: HTTP POST :8080/v1/chat/completions
Inf-->>Main: HTTP Response (streaming)
Main-->>Client: Proxied response
Client->>Main: POST /v1/embeddings
Main->>Emb: HTTP POST :8080/v1/embeddings
Emb-->>Main: HTTP Response
Main-->>Client: Proxied response
Note over Client, Web: Web Interface Flow
Web->>Main: WebSocket connection
Web->>Main: Chat message
Main->>Inf: Process inference
Inf-->>Main: Streaming tokens
Main-->>Web: WebSocket stream
```
### Port Configuration Matrix
```mermaid
graph TB
subgraph "Port Allocation by Mode"
subgraph "Development"
A[Main Server: 8080
All services embedded]
B[Leptos Dev: 8788
Hot reload: 3001]
end
subgraph "Docker Monolithic"
C[Main Server: 8080
All services embedded
Host mapped]
end
subgraph "Kubernetes Microservices"
D[Main Server: 8080]
E[Inference Engine: 8080]
F[Embeddings Engine: 8080]
G[Leptos Frontend: 8788]
end
end
style A fill:#e3f2fd
style C fill:#e8f5e8
style D fill:#f3e5f5
style E fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#fff3e0
```
## Platform-Specific Configurations
### Hardware Acceleration Support
```mermaid
graph TB
subgraph "Platform Detection"
A[Build System]
end
subgraph "macOS"
A --> B[Metal Features Available]
B --> C[CPU Fallback
Stability Priority]
C --> D[F32 Precision
Gemma Compatibility]
end
subgraph "Linux"
A --> E[CUDA Features Available]
E --> F[GPU Acceleration
Performance Priority]
F --> G[BF16 Precision
GPU Optimized]
E --> H[CPU Fallback
F32 Precision]
end
subgraph "Other Platforms"
A --> I[CPU Only
Universal Compatibility]
I --> J[F32 Precision
Standard Support]
end
style B fill:#e8f5e8
style E fill:#e3f2fd
style I fill:#fff3e0
```
### Model Loading and Caching
```mermaid
graph LR
subgraph "Model Access Flow"
A[Application Start] --> B{Model Cache Exists?}
B -->|Yes| C[Load from Cache]
B -->|No| D[HuggingFace Authentication]
D --> E{HF Token Valid?}
E -->|Yes| F[Download Model]
E -->|No| G[Authentication Error]
F --> H[Save to Cache]
H --> C
C --> I[Initialize Inference]
end
subgraph "Cache Locations"
J[HF_HOME Cache
.hf-cache]
K[FastEmbed Cache
.fastembed_cache]
end
F -.-> J
F -.-> K
style D fill:#fce4ec
style G fill:#ffebee
```
## Data Flow Patterns
### Request Processing Pipeline
```mermaid
flowchart TD
A[Client Request] --> B{Request Type}
B -->|Chat Completion| C[Parse Messages]
B -->|Model List| D[Return Available Models]
B -->|Embeddings| E[Process Text Input]
C --> F[Apply Prompt Template]
F --> G{Streaming?}
G -->|Yes| H[Initialize Stream]
G -->|No| I[Generate Complete Response]
H --> J[Token Generation Loop]
J --> K[Send Chunk]
K --> L{More Tokens?}
L -->|Yes| J
L -->|No| M[End Stream]
I --> N[Return Complete Response]
E --> O[Generate Embeddings]
O --> P[Return Vectors]
D --> Q[Return Model Metadata]
style A fill:#e3f2fd
style H fill:#e8f5e8
style I fill:#f3e5f5
style O fill:#fff3e0
```
### Authentication and Security Flow
```mermaid
sequenceDiagram
participant User as User/Client
participant App as Application
participant HF as HuggingFace Hub
participant Model as Model Cache
Note over User, Model: First-time Setup
User->>App: Start application
App->>HF: Check model access (gated)
HF-->>App: 401 Unauthorized
App-->>User: Requires HF authentication
User->>User: huggingface-cli login
User->>App: Retry start
App->>HF: Check model access (with token)
HF-->>App: 200 OK + model metadata
App->>HF: Download model files
HF-->>App: Model data stream
App->>Model: Cache model locally
Note over User, Model: Subsequent Runs
User->>App: Start application
App->>Model: Load cached model
Model-->>App: Ready for inference
```
---
## Summary
The Predict-Otron-9000 architecture provides maximum flexibility through:
- **Monolithic Mode**: Single server embedding all services for development and simple deployments
- **Microservices Mode**: Separate services for production scalability and fault isolation
- **Hybrid Capabilities**: Each service can operate as both library and standalone service
- **Platform Optimization**: Conditional compilation for optimal performance across CPU/GPU configurations
- **OpenAI Compatibility**: Standard API interfaces for seamless integration with existing tools
This flexible architecture allows teams to start with simple monolithic deployments and scale to distributed microservices as needs grow, all while maintaining API compatibility and leveraging platform-specific optimizations.