mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee 45d7cd8819 - Introduced ServerConfig for handling deployment modes and services.

- Added HighAvailability mode for proxying requests to external services.
- Maintained Local mode for embedded services.
- Updated `README.md` and included `SERVER_CONFIG.md` for detailed documentation.

2025-08-28 09:55:39 -04:00

8.9 KiB

Raw Blame History

predict-otron-9000

Warning: Do NOT use this in production unless you are cool like that.

Aliens, in a native executable.

Features

OpenAI Compatible: API endpoints match OpenAI's format for easy integration
Text Embeddings: Generate high-quality text embeddings using the Nomic Embed Text v1.5 model
Text Generation: Chat completions with OpenAI-compatible API using Gemma models (1B, 2B, 7B, 9B variants including base and instruction-tuned models)
Performance Optimized: Implements efficient caching and singleton patterns for improved throughput and reduced latency
Performance Benchmarking: Includes tools for measuring performance and generating HTML reports
Web Chat Interface: A Leptos-based WebAssembly (WASM) chat interface for browser-based interaction with the inference engine

Architecture

Core Components

predict-otron-9000: Main unified server that combines both engines
embeddings-engine: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model
inference-engine: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers
leptos-chat: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine

Installation

Prerequisites

Rust 1.70+ with 2024 edition support
Cargo package manager

Build from Source

# 1. Clone the repository
git clone <repository-url>
cd predict-otron-9000

# 2. Build the project
cargo build --release

# 3. Run the unified server
./run_server.sh

# Alternative: Build and run individual components
# For inference engine only:
cargo run -p inference-engine --release -- --server --port 3777
# For embeddings engine only:
cargo run -p embeddings-engine --release

Usage

Starting the Server

The server can be started using the provided script or directly with cargo:

# Using the provided script
./run_server.sh

# Or directly with cargo
cargo run --bin predict-otron-9000

Configuration

Environment variables for server configuration:

SERVER_HOST: Server bind address (default: 0.0.0.0)
SERVER_PORT: Server port (default: 8080)
SERVER_CONFIG: JSON configuration for deployment mode (default: Local mode)
RUST_LOG: Logging level configuration

Deployment Modes

The server supports two deployment modes controlled by SERVER_CONFIG:

Local Mode (default): Runs inference and embeddings services locally

./run_server.sh

HighAvailability Mode: Proxies requests to external services

export SERVER_CONFIG='{"serverMode": "HighAvailability"}'
./run_server.sh

See docs/SERVER_CONFIG.md for complete configuration options, Docker Compose, and Kubernetes examples.

Basic Configuration Example:

export SERVER_PORT=3000
export RUST_LOG=debug
./run_server.sh

API Endpoints

Text Embeddings

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Chat Completions

Generate chat completions (simplified implementation).

Endpoint: POST /v1/chat/completions

Request Body:

{
  "model": "gemma-2b-it",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ]
}

Response:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1699123456,
  "model": "gemma-2b-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! This is the unified predict-otron-9000 server..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 35,
    "total_tokens": 45
  }
}

Health Check

Endpoint: GET /

Returns a simple "Hello, World!" message to verify the server is running.

Development

Project Structure

predict-otron-9000/
├── Cargo.toml                 # Workspace configuration
├── README.md                  # This file
├── run_server.sh             # Server startup script
└── crates/
    ├── predict-otron-9000/   # Main unified server
    │   ├── Cargo.toml
    │   └── src/
    │       └── main.rs
    ├── embeddings-engine/    # Text embeddings functionality
    │   ├── Cargo.toml
    │   └── src/
    │       ├── lib.rs
    │       └── main.rs
    └── inference-engine/     # Text generation functionality
        ├── Cargo.toml
        ├── src/
        │   ├── lib.rs
        │   ├── cli.rs
        │   ├── server.rs
        │   ├── model.rs
        │   ├── text_generation.rs
        │   ├── token_output_stream.rs
        │   ├── utilities_lib.rs
        │   └── openai_types.rs
        └── tests/

Running Tests

# Run all tests
cargo test

# Run tests for a specific crate
cargo test -p embeddings-engine
cargo test -p inference-engine

For comprehensive testing documentation, including unit tests, integration tests, end-to-end tests, and performance testing, please refer to the TESTING.md document.

For performance benchmarking with HTML report generation, see the BENCHMARKING.md guide.

Adding Features

Embeddings Engine: Modify crates/embeddings-engine/src/lib.rs to add new embedding models or functionality
Inference Engine: The inference engine has a modular structure - add new models in the model.rs module
Unified Server: Update crates/predict-otron-9000/src/main.rs to integrate new capabilities

Logging and Debugging

The application uses structured logging with tracing. Log levels can be controlled via the RUST_LOG environment variable:

# Debug level logging
export RUST_LOG=debug

# Trace level for detailed embeddings debugging
export RUST_LOG=trace

# Module-specific logging
export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace

Chat Interface

The project includes a WebAssembly-based chat interface built with the Leptos framework.

Building the Chat Interface

# Navigate to the leptos-chat crate
cd crates/leptos-chat

# Build the WebAssembly package
cargo build --target wasm32-unknown-unknown

# For development with trunk (if installed)
trunk serve

Usage

The chat interface connects to the inference engine API and provides a user-friendly way to interact with the AI models. To use:

Start the predict-otron-9000 server
Open the chat interface in a web browser
Enter messages and receive AI-generated responses

The interface supports:

Real-time messaging with the AI
Visual indication of when the AI is generating a response
Message history display

Limitations

Inference Engine: Currently provides a simplified implementation for chat completions. Full model loading and text generation capabilities from the inference-engine crate are not yet integrated into the unified server.
Model Support: Embeddings are limited to the Nomic Embed Text v1.5 model.
Scalability: Single-threaded model loading may impact performance under heavy load.
Chat Interface: The WebAssembly chat interface requires compilation to a static site before deployment.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: cargo test
Submit a pull request

Quick cURL verification for Chat Endpoints

Start the unified server:

./run_server.sh

Non-streaming chat completion (expects JSON response):

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-it",
    "messages": [
      {"role": "user", "content": "Who was the 16th president of the United States?"}
    ],
    "max_tokens": 128,
    "stream": false
  }'

Streaming chat completion via Server-Sent Events (SSE):

curl -N -X POST http://localhost:8080/v1/chat/completions/stream \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-it",
    "messages": [
      {"role": "user", "content": "Who was the 16th president of the United States?"}
    ],
    "max_tokens": 128,
    "stream": true
  }'

Helper scripts are also available:

scripts/curl_chat.sh
scripts/curl_chat_stream.sh

8.9 KiB Raw Blame History

predict-otron-9000

Features

Architecture

Core Components

Installation

Prerequisites

Build from Source

Usage

Starting the Server

Configuration

Deployment Modes

Basic Configuration Example:

API Endpoints

Text Embeddings

Chat Completions

Health Check

Development

Project Structure

Running Tests

Adding Features

Logging and Debugging

Chat Interface

Building the Chat Interface

Usage

Limitations

Contributing

Quick cURL verification for Chat Endpoints

8.9 KiB

Raw Blame History