- Introduced ServerConfig for handling deployment modes and services.

- Added HighAvailability mode for proxying requests to external services. - Maintained Local mode for embedded services. - Updated `README.md` and included `SERVER_CONFIG.md` for detailed documentation.
2025-09-08 22:46:44 +00:00 · 2025-08-28 09:55:39 -04:00
parent c96831d494
commit 45d7cd8819
7 changed files with 823 additions and 29 deletions
--- a/docs/SERVER_CONFIG.md
+++ b/docs/SERVER_CONFIG.md
@@ -0,0 +1,223 @@
+# Server Configuration Guide
+
+The predict-otron-9000 server supports two deployment modes controlled by the `SERVER_CONFIG` environment variable:
+
+1. **Local Mode** (default): Runs inference and embeddings services locally within the main server process
+2. **HighAvailability Mode**: Proxies requests to external inference and embeddings services
+
+## Configuration Format
+
+The `SERVER_CONFIG` environment variable accepts a JSON configuration with the following structure:
+
+```json
+{
+  "serverMode": "Local",
+  "services": {
+    "inference_url": "http://inference-service:8080",
+    "embeddings_url": "http://embeddings-service:8080"
+  }
+}
+```
+
+or 
+
+```json
+{
+  "serverMode": "HighAvailability",
+  "services": {
+    "inference_url": "http://inference-service:8080",
+    "embeddings_url": "http://embeddings-service:8080"
+  }
+}
+```
+
+**Fields:**
+- `serverMode`: Either `"Local"` or `"HighAvailability"`
+- `services`: Optional object containing service URLs (uses defaults if not provided)
+
+## Local Mode (Default)
+
+If `SERVER_CONFIG` is not set or contains invalid JSON, the server defaults to Local mode.
+
+### Example: Explicit Local Mode
+```bash
+export SERVER_CONFIG='{"serverMode": "Local"}'
+./run_server.sh
+```
+
+In Local mode:
+- Inference requests are handled by the embedded inference engine
+- Embeddings requests are handled by the embedded embeddings engine
+- No external services are required
+- Supports all existing functionality without changes
+
+## HighAvailability Mode
+
+In HighAvailability mode, the server acts as a proxy, forwarding requests to external services.
+
+### Example: Basic HighAvailability Mode
+```bash
+export SERVER_CONFIG='{"serverMode": "HighAvailability"}'
+./run_server.sh
+```
+
+This uses the default service URLs:
+- Inference service: `http://inference-service:8080`
+- Embeddings service: `http://embeddings-service:8080`
+
+### Example: Custom Service URLs
+```bash
+export SERVER_CONFIG='{
+  "serverMode": "HighAvailability",
+  "services": {
+    "inference_url": "http://custom-inference:9000",
+    "embeddings_url": "http://custom-embeddings:9001"
+  }
+}'
+./run_server.sh
+```
+
+## Docker Compose Example
+
+```yaml
+version: '3.8'
+services:
+  # Inference service
+  inference-service:
+    image: ghcr.io/geoffsee/inference-service:latest
+    ports:
+      - "8081:8080"
+    environment:
+      - RUST_LOG=info
+
+  # Embeddings service  
+  embeddings-service:
+    image: ghcr.io/geoffsee/embeddings-service:latest
+    ports:
+      - "8082:8080"
+    environment:
+      - RUST_LOG=info
+
+  # Main proxy server
+  predict-otron-9000:
+    image: ghcr.io/geoffsee/predict-otron-9000:latest
+    ports:
+      - "8080:8080"
+    environment:
+      - RUST_LOG=info
+      - SERVER_CONFIG={"serverMode":"HighAvailability","services":{"inference_url":"http://inference-service:8080","embeddings_url":"http://embeddings-service:8080"}}
+    depends_on:
+      - inference-service
+      - embeddings-service
+```
+
+## Kubernetes Example
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: server-config
+data:
+  SERVER_CONFIG: |
+    {
+      "serverMode": "HighAvailability",
+      "services": {
+        "inference_url": "http://inference-service:8080",
+        "embeddings_url": "http://embeddings-service:8080"
+      }
+    }
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: predict-otron-9000
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: predict-otron-9000
+  template:
+    metadata:
+      labels:
+        app: predict-otron-9000
+    spec:
+      containers:
+      - name: predict-otron-9000
+        image: ghcr.io/geoffsee/predict-otron-9000:latest
+        ports:
+        - containerPort: 8080
+        env:
+        - name: RUST_LOG
+          value: "info"
+        - name: SERVER_CONFIG
+          valueFrom:
+            configMapKeyRef:
+              name: server-config
+              key: SERVER_CONFIG
+```
+
+## API Compatibility
+
+Both modes expose the same OpenAI-compatible API endpoints:
+
+- `POST /v1/chat/completions` - Chat completions (streaming and non-streaming)
+- `GET /v1/models` - List available models
+- `POST /v1/embeddings` - Generate text embeddings
+- `GET /health` - Health check
+- `GET /` - Root endpoint
+
+## Logging
+
+The server logs the selected mode on startup:
+
+**Local Mode:**
+```
+INFO predict_otron_9000: Running in Local mode - using embedded services
+```
+
+**HighAvailability Mode:**
+```
+INFO predict_otron_9000: Running in HighAvailability mode - proxying to external services
+INFO predict_otron_9000: Inference service URL: http://inference-service:8080
+INFO predict_otron_9000: Embeddings service URL: http://embeddings-service:8080
+```
+
+## Error Handling
+
+- Invalid JSON in `SERVER_CONFIG` falls back to Local mode with a warning
+- Missing `SERVER_CONFIG` defaults to Local mode
+- Network errors to external services return HTTP 502 (Bad Gateway)
+- Request/response proxying preserves original HTTP status codes and headers
+
+## Performance Considerations
+
+**Local Mode:**
+- Lower latency (no network overhead)
+- Higher memory usage (models loaded locally)
+- Single point of failure
+
+**HighAvailability Mode:**
+- Higher latency (network requests)
+- Lower memory usage (no local models)
+- Horizontal scaling possible
+- Network reliability dependent
+- 5-minute timeout for long-running inference requests
+
+## Troubleshooting
+
+1. **Configuration not applied**: Check JSON syntax and restart the server
+2. **External services unreachable**: Verify service URLs and network connectivity
+3. **Timeouts**: Check if inference requests exceed the 5-minute timeout limit
+4. **502 errors**: External services may be down or misconfigured
+
+## Migration
+
+To migrate from Local to HighAvailability mode:
+
+1. Deploy separate inference and embeddings services
+2. Update `SERVER_CONFIG` to point to the new services
+3. Restart the predict-otron-9000 server
+4. Verify endpoints are working with test requests
+
+The API contract remains identical, ensuring zero-downtime migration possibilities.