From 0488bddfdb8d9698cd5655469b7a375438f92a13 Mon Sep 17 00:00:00 2001
From: geoffsee <>
Date: Thu, 28 Aug 2025 12:22:05 -0400
Subject: [PATCH] Create ARCHITECTURE.md - update stale references to old chat
 crate

---
 README.md                 |  15 +-
 docs/ARCHITECTURE.md      | 430 ++++++++++++++++++++++++++++++++++++++
 helm-chart-tool/README.md |   2 +-
 3 files changed, 443 insertions(+), 4 deletions(-)
 create mode 100644 docs/ARCHITECTURE.md
diff --git a/README.md b/README.md
index 39583d2..ab76b80 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,16 @@ Aliens, in a native executable.
 - **`predict-otron-9000`**: Main unified server that combines both engines
 - **`embeddings-engine`**: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model
 - **`inference-engine`**: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers
-- **`leptos-chat`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
+- **`leptos-app`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
+
+## Further Reading
+
+### Documentation
+
+- [Architecture](docs/ARCHITECTURE.md) - Detailed server configuration options and deployment modes
+- [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options and deployment modes
+- [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide including unit, integration and e2e tests
+- [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for running and analyzing performance benchmarks
 
 ## Installation
 
@@ -262,8 +271,8 @@ The project includes a WebAssembly-based chat interface built with the Leptos fr
 ### Building the Chat Interface
 
 ```shell
-# Navigate to the leptos-chat crate
-cd crates/leptos-chat
+# Navigate to the leptos-app crate
+cd crates/leptos-app
 
 # Build the WebAssembly package
 cargo build --target wasm32-unknown-unknown
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..4979b4b
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,430 @@
+# Predict-Otron-9000 Architecture Documentation
+
+This document provides comprehensive architectural diagrams for the Predict-Otron-9000 multi-service AI platform, showing all supported configurations and deployment patterns.
+
+## Table of Contents
+
+- [System Overview](#system-overview)
+- [Workspace Structure](#workspace-structure)
+- [Deployment Configurations](#deployment-configurations)
+  - [Development Mode](#development-mode)
+  - [Docker Monolithic](#docker-monolithic)
+  - [Kubernetes Microservices](#kubernetes-microservices)
+- [Service Interactions](#service-interactions)
+- [Platform-Specific Configurations](#platform-specific-configurations)
+- [Data Flow Patterns](#data-flow-patterns)
+
+## System Overview
+
+The Predict-Otron-9000 is a comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces. The system supports flexible deployment patterns from monolithic to microservices architectures.
+
+```mermaid
+graph TB
+    subgraph "Core Components"
+        A[Main Server<br/>predict-otron-9000]
+        B[Inference Engine<br/>Gemma via Candle]
+        C[Embeddings Engine<br/>FastEmbed]
+        D[Web Frontend<br/>Leptos WASM]
+    end
+    
+    subgraph "Client Interfaces"
+        E[TypeScript CLI<br/>Bun/Node.js]
+        F[Web Browser<br/>HTTP/WebSocket]
+        G[HTTP API Clients<br/>OpenAI Compatible]
+    end
+    
+    subgraph "Platform Support"
+        H[CPU Fallback<br/>All Platforms]
+        I[CUDA Support<br/>Linux GPU]
+        J[Metal Support<br/>macOS GPU]
+    end
+    
+    A --- B
+    A --- C
+    A --- D
+    E -.-> A
+    F -.-> A
+    G -.-> A
+    B --- H
+    B --- I
+    B --- J
+```
+
+## Workspace Structure
+
+The project uses a 4-crate Rust workspace with TypeScript tooling, designed for maximum flexibility in deployment configurations.
+
+```mermaid
+graph TD
+    subgraph "Rust Workspace"
+        subgraph "Main Orchestrator"
+            A[predict-otron-9000<br/>Edition: 2024<br/>Port: 8080]
+        end
+        
+        subgraph "AI Services"
+            B[inference-engine<br/>Edition: 2021<br/>Port: 8080<br/>Candle ML]
+            C[embeddings-engine<br/>Edition: 2024<br/>Port: 8080<br/>FastEmbed]
+        end
+        
+        subgraph "Frontend"
+            D[leptos-app<br/>Edition: 2021<br/>Port: 3000/8788<br/>WASM/SSR]
+        end
+    end
+    
+    subgraph "External Tooling"
+        E[cli.ts<br/>TypeScript/Bun<br/>OpenAI SDK]
+    end
+    
+    subgraph "Dependencies"
+        A --> B
+        A --> C
+        A --> D
+        B -.-> F[Candle 0.9.1]
+        C -.-> G[FastEmbed 4.x]
+        D -.-> H[Leptos 0.8.0]
+        E -.-> I[OpenAI SDK 5.16+]
+    end
+    
+    style A fill:#e1f5fe
+    style B fill:#f3e5f5
+    style C fill:#e8f5e8
+    style D fill:#fff3e0
+    style E fill:#fce4ec
+```
+
+## Deployment Configurations
+
+### Development Mode
+
+Local development runs all services integrated within the main server for simplicity.
+
+```mermaid
+graph LR
+    subgraph "Development Environment"
+        subgraph "Single Process - Port 8080"
+            A[predict-otron-9000 Server]
+            A --> B[Embedded Inference Engine]
+            A --> C[Embedded Embeddings Engine]
+            A --> D[SSR Leptos Frontend]
+        end
+        
+        subgraph "Separate Frontend - Port 8788"
+            E[Trunk Dev Server<br/>Hot Reload: 3001]
+        end
+    end
+    
+    subgraph "External Clients"
+        F[CLI Client<br/>cli.ts via Bun]
+        G[Web Browser]
+        H[HTTP API Clients]
+    end
+    
+    F -.-> A
+    G -.-> A
+    G -.-> E
+    H -.-> A
+    
+    style A fill:#e3f2fd
+    style E fill:#f1f8e9
+```
+
+### Docker Monolithic
+
+Docker Compose runs a single containerized service handling all functionality.
+
+```mermaid
+graph TB
+    subgraph "Docker Environment"
+        subgraph "predict-otron-9000 Container"
+            A[Main Server :8080]
+            A --> B[Inference Engine<br/>Library Mode]
+            A --> C[Embeddings Engine<br/>Library Mode]
+            A --> D[Leptos Frontend<br/>SSR Mode]
+        end
+        
+        subgraph "Persistent Storage"
+            E[HF Cache Volume<br/>/.hf-cache]
+            F[FastEmbed Cache Volume<br/>/.fastembed_cache]
+        end
+        
+        subgraph "Network"
+            G[predict-otron-network<br/>Bridge Driver]
+        end
+    end
+    
+    subgraph "External Access"
+        H[Host Port 8080]
+        I[External Clients]
+    end
+    
+    A --- E
+    A --- F
+    A --- G
+    H --> A
+    I -.-> H
+    
+    style A fill:#e8f5e8
+    style E fill:#fff3e0
+    style F fill:#fff3e0
+```
+
+### Kubernetes Microservices
+
+Kubernetes deployment separates all services for horizontal scalability and fault isolation.
+
+```mermaid
+graph TB
+    subgraph "Kubernetes Namespace"
+        subgraph "Main Orchestrator"
+            A[predict-otron-9000 Pod<br/>:8080<br/>ClusterIP Service]
+        end
+        
+        subgraph "AI Services"
+            B[inference-engine Pod<br/>:8080<br/>ClusterIP Service]
+            C[embeddings-engine Pod<br/>:8080<br/>ClusterIP Service]
+        end
+        
+        subgraph "Frontend"
+            D[leptos-app Pod<br/>:8788<br/>ClusterIP Service]
+        end
+        
+        subgraph "Ingress"
+            E[Ingress Controller<br/>predict-otron-9000.local]
+        end
+    end
+    
+    subgraph "External"
+        F[External Clients]
+        G[Container Registry<br/>ghcr.io/geoffsee/*]
+    end
+    
+    A <--> B
+    A <--> C
+    E --> A
+    E --> D
+    F -.-> E
+    
+    G -.-> A
+    G -.-> B
+    G -.-> C
+    G -.-> D
+    
+    style A fill:#e3f2fd
+    style B fill:#f3e5f5
+    style C fill:#e8f5e8
+    style D fill:#fff3e0
+    style E fill:#fce4ec
+```
+
+## Service Interactions
+
+### API Flow and Communication Patterns
+
+```mermaid
+sequenceDiagram
+    participant Client as External Client
+    participant Main as Main Server<br/>(Port 8080)
+    participant Inf as Inference Engine
+    participant Emb as Embeddings Engine
+    participant Web as Web Frontend
+    
+    Note over Client, Web: Development/Monolithic Mode
+    Client->>Main: POST /v1/chat/completions
+    Main->>Inf: Internal call (library)
+    Inf-->>Main: Generated response
+    Main-->>Client: Streaming/Non-streaming response
+    
+    Client->>Main: POST /v1/embeddings
+    Main->>Emb: Internal call (library)
+    Emb-->>Main: Vector embeddings
+    Main-->>Client: Embeddings response
+    
+    Note over Client, Web: Kubernetes Microservices Mode
+    Client->>Main: POST /v1/chat/completions
+    Main->>Inf: HTTP POST :8080/v1/chat/completions
+    Inf-->>Main: HTTP Response (streaming)
+    Main-->>Client: Proxied response
+    
+    Client->>Main: POST /v1/embeddings
+    Main->>Emb: HTTP POST :8080/v1/embeddings
+    Emb-->>Main: HTTP Response
+    Main-->>Client: Proxied response
+    
+    Note over Client, Web: Web Interface Flow
+    Web->>Main: WebSocket connection
+    Web->>Main: Chat message
+    Main->>Inf: Process inference
+    Inf-->>Main: Streaming tokens
+    Main-->>Web: WebSocket stream
+```
+
+### Port Configuration Matrix
+
+```mermaid
+graph TB
+    subgraph "Port Allocation by Mode"
+        subgraph "Development"
+            A[Main Server: 8080<br/>All services embedded]
+            B[Leptos Dev: 8788<br/>Hot reload: 3001]
+        end
+        
+        subgraph "Docker Monolithic"
+            C[Main Server: 8080<br/>All services embedded<br/>Host mapped]
+        end
+        
+        subgraph "Kubernetes Microservices"
+            D[Main Server: 8080]
+            E[Inference Engine: 8080]
+            F[Embeddings Engine: 8080]
+            G[Leptos Frontend: 8788]
+        end
+    end
+    
+    style A fill:#e3f2fd
+    style C fill:#e8f5e8
+    style D fill:#f3e5f5
+    style E fill:#f3e5f5
+    style F fill:#e8f5e8
+    style G fill:#fff3e0
+```
+
+## Platform-Specific Configurations
+
+### Hardware Acceleration Support
+
+```mermaid
+graph TB
+    subgraph "Platform Detection"
+        A[Build System]
+    end
+    
+    subgraph "macOS"
+        A --> B[Metal Features Available]
+        B --> C[CPU Fallback<br/>Stability Priority]
+        C --> D[F32 Precision<br/>Gemma Compatibility]
+    end
+    
+    subgraph "Linux"
+        A --> E[CUDA Features Available]
+        E --> F[GPU Acceleration<br/>Performance Priority]
+        F --> G[BF16 Precision<br/>GPU Optimized]
+        E --> H[CPU Fallback<br/>F32 Precision]
+    end
+    
+    subgraph "Other Platforms"
+        A --> I[CPU Only<br/>Universal Compatibility]
+        I --> J[F32 Precision<br/>Standard Support]
+    end
+    
+    style B fill:#e8f5e8
+    style E fill:#e3f2fd
+    style I fill:#fff3e0
+```
+
+### Model Loading and Caching
+
+```mermaid
+graph LR
+    subgraph "Model Access Flow"
+        A[Application Start] --> B{Model Cache Exists?}
+        B -->|Yes| C[Load from Cache]
+        B -->|No| D[HuggingFace Authentication]
+        D --> E{HF Token Valid?}
+        E -->|Yes| F[Download Model]
+        E -->|No| G[Authentication Error]
+        F --> H[Save to Cache]
+        H --> C
+        C --> I[Initialize Inference]
+    end
+    
+    subgraph "Cache Locations"
+        J[HF_HOME Cache<br/>.hf-cache]
+        K[FastEmbed Cache<br/>.fastembed_cache]
+    end
+    
+    F -.-> J
+    F -.-> K
+    
+    style D fill:#fce4ec
+    style G fill:#ffebee
+```
+
+## Data Flow Patterns
+
+### Request Processing Pipeline
+
+```mermaid
+flowchart TD
+    A[Client Request] --> B{Request Type}
+    
+    B -->|Chat Completion| C[Parse Messages]
+    B -->|Model List| D[Return Available Models]
+    B -->|Embeddings| E[Process Text Input]
+    
+    C --> F[Apply Prompt Template]
+    F --> G{Streaming?}
+    
+    G -->|Yes| H[Initialize Stream]
+    G -->|No| I[Generate Complete Response]
+    
+    H --> J[Token Generation Loop]
+    J --> K[Send Chunk]
+    K --> L{More Tokens?}
+    L -->|Yes| J
+    L -->|No| M[End Stream]
+    
+    I --> N[Return Complete Response]
+    
+    E --> O[Generate Embeddings]
+    O --> P[Return Vectors]
+    
+    D --> Q[Return Model Metadata]
+    
+    style A fill:#e3f2fd
+    style H fill:#e8f5e8
+    style I fill:#f3e5f5
+    style O fill:#fff3e0
+```
+
+### Authentication and Security Flow
+
+```mermaid
+sequenceDiagram
+    participant User as User/Client
+    participant App as Application
+    participant HF as HuggingFace Hub
+    participant Model as Model Cache
+    
+    Note over User, Model: First-time Setup
+    User->>App: Start application
+    App->>HF: Check model access (gated)
+    HF-->>App: 401 Unauthorized
+    App-->>User: Requires HF authentication
+    
+    User->>User: huggingface-cli login
+    User->>App: Retry start
+    App->>HF: Check model access (with token)
+    HF-->>App: 200 OK + model metadata
+    App->>HF: Download model files
+    HF-->>App: Model data stream
+    App->>Model: Cache model locally
+    
+    Note over User, Model: Subsequent Runs
+    User->>App: Start application  
+    App->>Model: Load cached model
+    Model-->>App: Ready for inference
+```
+
+---
+
+## Summary
+
+The Predict-Otron-9000 architecture provides maximum flexibility through:
+
+- **Monolithic Mode**: Single server embedding all services for development and simple deployments
+- **Microservices Mode**: Separate services for production scalability and fault isolation
+- **Hybrid Capabilities**: Each service can operate as both library and standalone service
+- **Platform Optimization**: Conditional compilation for optimal performance across CPU/GPU configurations
+- **OpenAI Compatibility**: Standard API interfaces for seamless integration with existing tools
+
+This flexible architecture allows teams to start with simple monolithic deployments and scale to distributed microservices as needs grow, all while maintaining API compatibility and leveraging platform-specific optimizations.
\ No newline at end of file
diff --git a/helm-chart-tool/README.md b/helm-chart-tool/README.md
index 6036328..58ee48d 100644
--- a/helm-chart-tool/README.md
+++ b/helm-chart-tool/README.md
@@ -137,7 +137,7 @@ Parsing workspace at: ..
 Output directory: ../generated-helm-chart
 Chart name: predict-otron-9000
 Found 4 services:
-  - leptos-chat: ghcr.io/geoffsee/leptos-chat:latest (port 8788)
+  - leptos-app: ghcr.io/geoffsee/leptos-app:latest (port 8788)
   - inference-engine: ghcr.io/geoffsee/inference-service:latest (port 8080)
   - embeddings-engine: ghcr.io/geoffsee/embeddings-service:latest (port 8080)
   - predict-otron-9000: ghcr.io/geoffsee/predict-otron-9000:latest (port 8080)