geoffsee/predict-otron-9001

Fork 0

mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

Geoff Seemueller f7001fc72b remove arbitrary keys for standalone config

2025-08-28 13:19:48 -04:00

5.7 KiB

Raw Permalink Blame History

Server Configuration Guide

The predict-otron-9000 server supports two deployment modes controlled by the SERVER_CONFIG environment variable:

Standalone Mode (default): Runs inference and embeddings services locally within the main server process
HighAvailability Mode: Proxies requests to external inference and embeddings services

Configuration Format

The SERVER_CONFIG environment variable accepts a JSON configuration with the following structure:

{
  "serverMode": "Standalone"
}

{
  "serverMode": "HighAvailability",
  "services": {
    "inference_url": "http://inference-service:8080",
    "embeddings_url": "http://embeddings-service:8080"
  }
}

Fields:

serverMode: Either "Local" or "HighAvailability"
services: Optional object containing service URLs (uses defaults if not provided)

Standalone Mode (Default)

If SERVER_CONFIG is not set or contains invalid JSON, the server defaults to Local mode.

Example: Explicit Local Mode

export SERVER_CONFIG='{"serverMode": "Standalone"}'
./run_server.sh

In Standalone mode:

Inference requests are handled by the embedded inference engine
Embeddings requests are handled by the embedded embeddings engine
No external services are required
Supports all existing functionality without changes

HighAvailability Mode

In HighAvailability mode, the server acts as a proxy, forwarding requests to external services.

Example: Basic HighAvailability Mode

export SERVER_CONFIG='{"serverMode": "HighAvailability"}'
./run_server.sh

This uses the default service URLs:

Inference service: http://inference-service:8080
Embeddings service: http://embeddings-service:8080

Example: Custom Service URLs

export SERVER_CONFIG='{
  "serverMode": "HighAvailability",
  "services": {
    "inference_url": "http://custom-inference:9000",
    "embeddings_url": "http://custom-embeddings:9001"
  }
}'
./run_server.sh

Docker Compose Example

version: '3.8'
services:
  # Inference service
  inference-service:
    image: ghcr.io/geoffsee/inference-service:latest
    ports:
      - "8081:8080"
    environment:
      - RUST_LOG=info

  # Embeddings service  
  embeddings-service:
    image: ghcr.io/geoffsee/embeddings-service:latest
    ports:
      - "8082:8080"
    environment:
      - RUST_LOG=info

  # Main proxy server
  predict-otron-9000:
    image: ghcr.io/geoffsee/predict-otron-9000:latest
    ports:
      - "8080:8080"
    environment:
      - RUST_LOG=info
      - SERVER_CONFIG={"serverMode":"HighAvailability","services":{"inference_url":"http://inference-service:8080","embeddings_url":"http://embeddings-service:8080"}}
    depends_on:
      - inference-service
      - embeddings-service

Kubernetes Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: server-config
data:
  SERVER_CONFIG: |
    {
      "serverMode": "HighAvailability",
      "services": {
        "inference_url": "http://inference-service:8080",
        "embeddings_url": "http://embeddings-service:8080"
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: predict-otron-9000
spec:
  replicas: 3
  selector:
    matchLabels:
      app: predict-otron-9000
  template:
    metadata:
      labels:
        app: predict-otron-9000
    spec:
      containers:
      - name: predict-otron-9000
        image: ghcr.io/geoffsee/predict-otron-9000:latest
        ports:
        - containerPort: 8080
        env:
        - name: RUST_LOG
          value: "info"
        - name: SERVER_CONFIG
          valueFrom:
            configMapKeyRef:
              name: server-config
              key: SERVER_CONFIG

API Compatibility

Both modes expose the same OpenAI-compatible API endpoints:

POST /v1/chat/completions - Chat completions (streaming and non-streaming)
GET /v1/models - List available models
POST /v1/embeddings - Generate text embeddings
GET /health - Health check
GET / - Root endpoint

Logging

The server logs the selected mode on startup:

Local Mode:

INFO predict_otron_9000: Running in Standalone mode

HighAvailability Mode:

INFO predict_otron_9000: Running in HighAvailability mode - proxying to external services
INFO predict_otron_9000: Inference service URL: http://inference-service:8080
INFO predict_otron_9000: Embeddings service URL: http://embeddings-service:8080

Error Handling

Invalid JSON in SERVER_CONFIG falls back to Local mode with a warning
Missing SERVER_CONFIG defaults to Local mode
Network errors to external services return HTTP 502 (Bad Gateway)
Request/response proxying preserves original HTTP status codes and headers

Performance Considerations