mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
Create ARCHITECTURE.md - update stale references to old chat crate
This commit is contained in:
15
README.md
15
README.md
@@ -26,7 +26,16 @@ Aliens, in a native executable.
|
|||||||
- **`predict-otron-9000`**: Main unified server that combines both engines
|
- **`predict-otron-9000`**: Main unified server that combines both engines
|
||||||
- **`embeddings-engine`**: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model
|
- **`embeddings-engine`**: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model
|
||||||
- **`inference-engine`**: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers
|
- **`inference-engine`**: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers
|
||||||
- **`leptos-chat`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
|
- **`leptos-app`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
|
||||||
|
|
||||||
|
## Further Reading
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
- [Architecture](docs/ARCHITECTURE.md) - Detailed server configuration options and deployment modes
|
||||||
|
- [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options and deployment modes
|
||||||
|
- [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide including unit, integration and e2e tests
|
||||||
|
- [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for running and analyzing performance benchmarks
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@@ -262,8 +271,8 @@ The project includes a WebAssembly-based chat interface built with the Leptos fr
|
|||||||
### Building the Chat Interface
|
### Building the Chat Interface
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# Navigate to the leptos-chat crate
|
# Navigate to the leptos-app crate
|
||||||
cd crates/leptos-chat
|
cd crates/leptos-app
|
||||||
|
|
||||||
# Build the WebAssembly package
|
# Build the WebAssembly package
|
||||||
cargo build --target wasm32-unknown-unknown
|
cargo build --target wasm32-unknown-unknown
|
||||||
|
430
docs/ARCHITECTURE.md
Normal file
430
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,430 @@
|
|||||||
|
# Predict-Otron-9000 Architecture Documentation
|
||||||
|
|
||||||
|
This document provides comprehensive architectural diagrams for the Predict-Otron-9000 multi-service AI platform, showing all supported configurations and deployment patterns.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [System Overview](#system-overview)
|
||||||
|
- [Workspace Structure](#workspace-structure)
|
||||||
|
- [Deployment Configurations](#deployment-configurations)
|
||||||
|
- [Development Mode](#development-mode)
|
||||||
|
- [Docker Monolithic](#docker-monolithic)
|
||||||
|
- [Kubernetes Microservices](#kubernetes-microservices)
|
||||||
|
- [Service Interactions](#service-interactions)
|
||||||
|
- [Platform-Specific Configurations](#platform-specific-configurations)
|
||||||
|
- [Data Flow Patterns](#data-flow-patterns)
|
||||||
|
|
||||||
|
## System Overview
|
||||||
|
|
||||||
|
The Predict-Otron-9000 is a comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces. The system supports flexible deployment patterns from monolithic to microservices architectures.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Core Components"
|
||||||
|
A[Main Server<br/>predict-otron-9000]
|
||||||
|
B[Inference Engine<br/>Gemma via Candle]
|
||||||
|
C[Embeddings Engine<br/>FastEmbed]
|
||||||
|
D[Web Frontend<br/>Leptos WASM]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Client Interfaces"
|
||||||
|
E[TypeScript CLI<br/>Bun/Node.js]
|
||||||
|
F[Web Browser<br/>HTTP/WebSocket]
|
||||||
|
G[HTTP API Clients<br/>OpenAI Compatible]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Platform Support"
|
||||||
|
H[CPU Fallback<br/>All Platforms]
|
||||||
|
I[CUDA Support<br/>Linux GPU]
|
||||||
|
J[Metal Support<br/>macOS GPU]
|
||||||
|
end
|
||||||
|
|
||||||
|
A --- B
|
||||||
|
A --- C
|
||||||
|
A --- D
|
||||||
|
E -.-> A
|
||||||
|
F -.-> A
|
||||||
|
G -.-> A
|
||||||
|
B --- H
|
||||||
|
B --- I
|
||||||
|
B --- J
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workspace Structure
|
||||||
|
|
||||||
|
The project uses a 4-crate Rust workspace with TypeScript tooling, designed for maximum flexibility in deployment configurations.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
subgraph "Rust Workspace"
|
||||||
|
subgraph "Main Orchestrator"
|
||||||
|
A[predict-otron-9000<br/>Edition: 2024<br/>Port: 8080]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "AI Services"
|
||||||
|
B[inference-engine<br/>Edition: 2021<br/>Port: 8080<br/>Candle ML]
|
||||||
|
C[embeddings-engine<br/>Edition: 2024<br/>Port: 8080<br/>FastEmbed]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Frontend"
|
||||||
|
D[leptos-app<br/>Edition: 2021<br/>Port: 3000/8788<br/>WASM/SSR]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "External Tooling"
|
||||||
|
E[cli.ts<br/>TypeScript/Bun<br/>OpenAI SDK]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Dependencies"
|
||||||
|
A --> B
|
||||||
|
A --> C
|
||||||
|
A --> D
|
||||||
|
B -.-> F[Candle 0.9.1]
|
||||||
|
C -.-> G[FastEmbed 4.x]
|
||||||
|
D -.-> H[Leptos 0.8.0]
|
||||||
|
E -.-> I[OpenAI SDK 5.16+]
|
||||||
|
end
|
||||||
|
|
||||||
|
style A fill:#e1f5fe
|
||||||
|
style B fill:#f3e5f5
|
||||||
|
style C fill:#e8f5e8
|
||||||
|
style D fill:#fff3e0
|
||||||
|
style E fill:#fce4ec
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deployment Configurations
|
||||||
|
|
||||||
|
### Development Mode
|
||||||
|
|
||||||
|
Local development runs all services integrated within the main server for simplicity.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph LR
|
||||||
|
subgraph "Development Environment"
|
||||||
|
subgraph "Single Process - Port 8080"
|
||||||
|
A[predict-otron-9000 Server]
|
||||||
|
A --> B[Embedded Inference Engine]
|
||||||
|
A --> C[Embedded Embeddings Engine]
|
||||||
|
A --> D[SSR Leptos Frontend]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Separate Frontend - Port 8788"
|
||||||
|
E[Trunk Dev Server<br/>Hot Reload: 3001]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "External Clients"
|
||||||
|
F[CLI Client<br/>cli.ts via Bun]
|
||||||
|
G[Web Browser]
|
||||||
|
H[HTTP API Clients]
|
||||||
|
end
|
||||||
|
|
||||||
|
F -.-> A
|
||||||
|
G -.-> A
|
||||||
|
G -.-> E
|
||||||
|
H -.-> A
|
||||||
|
|
||||||
|
style A fill:#e3f2fd
|
||||||
|
style E fill:#f1f8e9
|
||||||
|
```
|
||||||
|
|
||||||
|
### Docker Monolithic
|
||||||
|
|
||||||
|
Docker Compose runs a single containerized service handling all functionality.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Docker Environment"
|
||||||
|
subgraph "predict-otron-9000 Container"
|
||||||
|
A[Main Server :8080]
|
||||||
|
A --> B[Inference Engine<br/>Library Mode]
|
||||||
|
A --> C[Embeddings Engine<br/>Library Mode]
|
||||||
|
A --> D[Leptos Frontend<br/>SSR Mode]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Persistent Storage"
|
||||||
|
E[HF Cache Volume<br/>/.hf-cache]
|
||||||
|
F[FastEmbed Cache Volume<br/>/.fastembed_cache]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Network"
|
||||||
|
G[predict-otron-network<br/>Bridge Driver]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "External Access"
|
||||||
|
H[Host Port 8080]
|
||||||
|
I[External Clients]
|
||||||
|
end
|
||||||
|
|
||||||
|
A --- E
|
||||||
|
A --- F
|
||||||
|
A --- G
|
||||||
|
H --> A
|
||||||
|
I -.-> H
|
||||||
|
|
||||||
|
style A fill:#e8f5e8
|
||||||
|
style E fill:#fff3e0
|
||||||
|
style F fill:#fff3e0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Kubernetes Microservices
|
||||||
|
|
||||||
|
Kubernetes deployment separates all services for horizontal scalability and fault isolation.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Kubernetes Namespace"
|
||||||
|
subgraph "Main Orchestrator"
|
||||||
|
A[predict-otron-9000 Pod<br/>:8080<br/>ClusterIP Service]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "AI Services"
|
||||||
|
B[inference-engine Pod<br/>:8080<br/>ClusterIP Service]
|
||||||
|
C[embeddings-engine Pod<br/>:8080<br/>ClusterIP Service]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Frontend"
|
||||||
|
D[leptos-app Pod<br/>:8788<br/>ClusterIP Service]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Ingress"
|
||||||
|
E[Ingress Controller<br/>predict-otron-9000.local]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "External"
|
||||||
|
F[External Clients]
|
||||||
|
G[Container Registry<br/>ghcr.io/geoffsee/*]
|
||||||
|
end
|
||||||
|
|
||||||
|
A <--> B
|
||||||
|
A <--> C
|
||||||
|
E --> A
|
||||||
|
E --> D
|
||||||
|
F -.-> E
|
||||||
|
|
||||||
|
G -.-> A
|
||||||
|
G -.-> B
|
||||||
|
G -.-> C
|
||||||
|
G -.-> D
|
||||||
|
|
||||||
|
style A fill:#e3f2fd
|
||||||
|
style B fill:#f3e5f5
|
||||||
|
style C fill:#e8f5e8
|
||||||
|
style D fill:#fff3e0
|
||||||
|
style E fill:#fce4ec
|
||||||
|
```
|
||||||
|
|
||||||
|
## Service Interactions
|
||||||
|
|
||||||
|
### API Flow and Communication Patterns
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant Client as External Client
|
||||||
|
participant Main as Main Server<br/>(Port 8080)
|
||||||
|
participant Inf as Inference Engine
|
||||||
|
participant Emb as Embeddings Engine
|
||||||
|
participant Web as Web Frontend
|
||||||
|
|
||||||
|
Note over Client, Web: Development/Monolithic Mode
|
||||||
|
Client->>Main: POST /v1/chat/completions
|
||||||
|
Main->>Inf: Internal call (library)
|
||||||
|
Inf-->>Main: Generated response
|
||||||
|
Main-->>Client: Streaming/Non-streaming response
|
||||||
|
|
||||||
|
Client->>Main: POST /v1/embeddings
|
||||||
|
Main->>Emb: Internal call (library)
|
||||||
|
Emb-->>Main: Vector embeddings
|
||||||
|
Main-->>Client: Embeddings response
|
||||||
|
|
||||||
|
Note over Client, Web: Kubernetes Microservices Mode
|
||||||
|
Client->>Main: POST /v1/chat/completions
|
||||||
|
Main->>Inf: HTTP POST :8080/v1/chat/completions
|
||||||
|
Inf-->>Main: HTTP Response (streaming)
|
||||||
|
Main-->>Client: Proxied response
|
||||||
|
|
||||||
|
Client->>Main: POST /v1/embeddings
|
||||||
|
Main->>Emb: HTTP POST :8080/v1/embeddings
|
||||||
|
Emb-->>Main: HTTP Response
|
||||||
|
Main-->>Client: Proxied response
|
||||||
|
|
||||||
|
Note over Client, Web: Web Interface Flow
|
||||||
|
Web->>Main: WebSocket connection
|
||||||
|
Web->>Main: Chat message
|
||||||
|
Main->>Inf: Process inference
|
||||||
|
Inf-->>Main: Streaming tokens
|
||||||
|
Main-->>Web: WebSocket stream
|
||||||
|
```
|
||||||
|
|
||||||
|
### Port Configuration Matrix
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Port Allocation by Mode"
|
||||||
|
subgraph "Development"
|
||||||
|
A[Main Server: 8080<br/>All services embedded]
|
||||||
|
B[Leptos Dev: 8788<br/>Hot reload: 3001]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Docker Monolithic"
|
||||||
|
C[Main Server: 8080<br/>All services embedded<br/>Host mapped]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Kubernetes Microservices"
|
||||||
|
D[Main Server: 8080]
|
||||||
|
E[Inference Engine: 8080]
|
||||||
|
F[Embeddings Engine: 8080]
|
||||||
|
G[Leptos Frontend: 8788]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
style A fill:#e3f2fd
|
||||||
|
style C fill:#e8f5e8
|
||||||
|
style D fill:#f3e5f5
|
||||||
|
style E fill:#f3e5f5
|
||||||
|
style F fill:#e8f5e8
|
||||||
|
style G fill:#fff3e0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Platform-Specific Configurations
|
||||||
|
|
||||||
|
### Hardware Acceleration Support
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Platform Detection"
|
||||||
|
A[Build System]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "macOS"
|
||||||
|
A --> B[Metal Features Available]
|
||||||
|
B --> C[CPU Fallback<br/>Stability Priority]
|
||||||
|
C --> D[F32 Precision<br/>Gemma Compatibility]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Linux"
|
||||||
|
A --> E[CUDA Features Available]
|
||||||
|
E --> F[GPU Acceleration<br/>Performance Priority]
|
||||||
|
F --> G[BF16 Precision<br/>GPU Optimized]
|
||||||
|
E --> H[CPU Fallback<br/>F32 Precision]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Other Platforms"
|
||||||
|
A --> I[CPU Only<br/>Universal Compatibility]
|
||||||
|
I --> J[F32 Precision<br/>Standard Support]
|
||||||
|
end
|
||||||
|
|
||||||
|
style B fill:#e8f5e8
|
||||||
|
style E fill:#e3f2fd
|
||||||
|
style I fill:#fff3e0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model Loading and Caching
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph LR
|
||||||
|
subgraph "Model Access Flow"
|
||||||
|
A[Application Start] --> B{Model Cache Exists?}
|
||||||
|
B -->|Yes| C[Load from Cache]
|
||||||
|
B -->|No| D[HuggingFace Authentication]
|
||||||
|
D --> E{HF Token Valid?}
|
||||||
|
E -->|Yes| F[Download Model]
|
||||||
|
E -->|No| G[Authentication Error]
|
||||||
|
F --> H[Save to Cache]
|
||||||
|
H --> C
|
||||||
|
C --> I[Initialize Inference]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Cache Locations"
|
||||||
|
J[HF_HOME Cache<br/>.hf-cache]
|
||||||
|
K[FastEmbed Cache<br/>.fastembed_cache]
|
||||||
|
end
|
||||||
|
|
||||||
|
F -.-> J
|
||||||
|
F -.-> K
|
||||||
|
|
||||||
|
style D fill:#fce4ec
|
||||||
|
style G fill:#ffebee
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Flow Patterns
|
||||||
|
|
||||||
|
### Request Processing Pipeline
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
A[Client Request] --> B{Request Type}
|
||||||
|
|
||||||
|
B -->|Chat Completion| C[Parse Messages]
|
||||||
|
B -->|Model List| D[Return Available Models]
|
||||||
|
B -->|Embeddings| E[Process Text Input]
|
||||||
|
|
||||||
|
C --> F[Apply Prompt Template]
|
||||||
|
F --> G{Streaming?}
|
||||||
|
|
||||||
|
G -->|Yes| H[Initialize Stream]
|
||||||
|
G -->|No| I[Generate Complete Response]
|
||||||
|
|
||||||
|
H --> J[Token Generation Loop]
|
||||||
|
J --> K[Send Chunk]
|
||||||
|
K --> L{More Tokens?}
|
||||||
|
L -->|Yes| J
|
||||||
|
L -->|No| M[End Stream]
|
||||||
|
|
||||||
|
I --> N[Return Complete Response]
|
||||||
|
|
||||||
|
E --> O[Generate Embeddings]
|
||||||
|
O --> P[Return Vectors]
|
||||||
|
|
||||||
|
D --> Q[Return Model Metadata]
|
||||||
|
|
||||||
|
style A fill:#e3f2fd
|
||||||
|
style H fill:#e8f5e8
|
||||||
|
style I fill:#f3e5f5
|
||||||
|
style O fill:#fff3e0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Authentication and Security Flow
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant User as User/Client
|
||||||
|
participant App as Application
|
||||||
|
participant HF as HuggingFace Hub
|
||||||
|
participant Model as Model Cache
|
||||||
|
|
||||||
|
Note over User, Model: First-time Setup
|
||||||
|
User->>App: Start application
|
||||||
|
App->>HF: Check model access (gated)
|
||||||
|
HF-->>App: 401 Unauthorized
|
||||||
|
App-->>User: Requires HF authentication
|
||||||
|
|
||||||
|
User->>User: huggingface-cli login
|
||||||
|
User->>App: Retry start
|
||||||
|
App->>HF: Check model access (with token)
|
||||||
|
HF-->>App: 200 OK + model metadata
|
||||||
|
App->>HF: Download model files
|
||||||
|
HF-->>App: Model data stream
|
||||||
|
App->>Model: Cache model locally
|
||||||
|
|
||||||
|
Note over User, Model: Subsequent Runs
|
||||||
|
User->>App: Start application
|
||||||
|
App->>Model: Load cached model
|
||||||
|
Model-->>App: Ready for inference
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The Predict-Otron-9000 architecture provides maximum flexibility through:
|
||||||
|
|
||||||
|
- **Monolithic Mode**: Single server embedding all services for development and simple deployments
|
||||||
|
- **Microservices Mode**: Separate services for production scalability and fault isolation
|
||||||
|
- **Hybrid Capabilities**: Each service can operate as both library and standalone service
|
||||||
|
- **Platform Optimization**: Conditional compilation for optimal performance across CPU/GPU configurations
|
||||||
|
- **OpenAI Compatibility**: Standard API interfaces for seamless integration with existing tools
|
||||||
|
|
||||||
|
This flexible architecture allows teams to start with simple monolithic deployments and scale to distributed microservices as needs grow, all while maintaining API compatibility and leveraging platform-specific optimizations.
|
@@ -137,7 +137,7 @@ Parsing workspace at: ..
|
|||||||
Output directory: ../generated-helm-chart
|
Output directory: ../generated-helm-chart
|
||||||
Chart name: predict-otron-9000
|
Chart name: predict-otron-9000
|
||||||
Found 4 services:
|
Found 4 services:
|
||||||
- leptos-chat: ghcr.io/geoffsee/leptos-chat:latest (port 8788)
|
- leptos-app: ghcr.io/geoffsee/leptos-app:latest (port 8788)
|
||||||
- inference-engine: ghcr.io/geoffsee/inference-service:latest (port 8080)
|
- inference-engine: ghcr.io/geoffsee/inference-service:latest (port 8080)
|
||||||
- embeddings-engine: ghcr.io/geoffsee/embeddings-service:latest (port 8080)
|
- embeddings-engine: ghcr.io/geoffsee/embeddings-service:latest (port 8080)
|
||||||
- predict-otron-9000: ghcr.io/geoffsee/predict-otron-9000:latest (port 8080)
|
- predict-otron-9000: ghcr.io/geoffsee/predict-otron-9000:latest (port 8080)
|
||||||
|
Reference in New Issue
Block a user