Files
predict-otron-9001/docs/SERVER_CONFIG.md
2025-08-28 12:06:22 -04:00

223 lines
5.8 KiB
Markdown

# Server Configuration Guide
The predict-otron-9000 server supports two deployment modes controlled by the `SERVER_CONFIG` environment variable:
1. **Local Mode** (default): Runs inference and embeddings services locally within the main server process
2. **HighAvailability Mode**: Proxies requests to external inference and embeddings services
## Configuration Format
The `SERVER_CONFIG` environment variable accepts a JSON configuration with the following structure:
```json
{
"serverMode": "Local",
"services": {
"inference_url": "http://inference-service:8080",
"embeddings_url": "http://embeddings-service:8080"
}
}
```
or
```json
{
"serverMode": "HighAvailability",
"services": {
"inference_url": "http://inference-service:8080",
"embeddings_url": "http://embeddings-service:8080"
}
}
```
**Fields:**
- `serverMode`: Either `"Local"` or `"HighAvailability"`
- `services`: Optional object containing service URLs (uses defaults if not provided)
## Local Mode (Default)
If `SERVER_CONFIG` is not set or contains invalid JSON, the server defaults to Local mode.
### Example: Explicit Local Mode
```bash
export SERVER_CONFIG='{"serverMode": "Local"}'
./run_server.sh
```
In Local mode:
- Inference requests are handled by the embedded inference engine
- Embeddings requests are handled by the embedded embeddings engine
- No external services are required
- Supports all existing functionality without changes
## HighAvailability Mode
In HighAvailability mode, the server acts as a proxy, forwarding requests to external services.
### Example: Basic HighAvailability Mode
```bash
export SERVER_CONFIG='{"serverMode": "HighAvailability"}'
./run_server.sh
```
This uses the default service URLs:
- Inference service: `http://inference-service:8080`
- Embeddings service: `http://embeddings-service:8080`
### Example: Custom Service URLs
```bash
export SERVER_CONFIG='{
"serverMode": "HighAvailability",
"services": {
"inference_url": "http://custom-inference:9000",
"embeddings_url": "http://custom-embeddings:9001"
}
}'
./run_server.sh
```
## Docker Compose Example
```yaml
version: '3.8'
services:
# Inference service
inference-service:
image: ghcr.io/geoffsee/inference-service:latest
ports:
- "8081:8080"
environment:
- RUST_LOG=info
# Embeddings service
embeddings-service:
image: ghcr.io/geoffsee/embeddings-service:latest
ports:
- "8082:8080"
environment:
- RUST_LOG=info
# Main proxy server
predict-otron-9000:
image: ghcr.io/geoffsee/predict-otron-9000:latest
ports:
- "8080:8080"
environment:
- RUST_LOG=info
- SERVER_CONFIG={"serverMode":"HighAvailability","services":{"inference_url":"http://inference-service:8080","embeddings_url":"http://embeddings-service:8080"}}
depends_on:
- inference-service
- embeddings-service
```
## Kubernetes Example
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: server-config
data:
SERVER_CONFIG: |
{
"serverMode": "HighAvailability",
"services": {
"inference_url": "http://inference-service:8080",
"embeddings_url": "http://embeddings-service:8080"
}
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: predict-otron-9000
spec:
replicas: 3
selector:
matchLabels:
app: predict-otron-9000
template:
metadata:
labels:
app: predict-otron-9000
spec:
containers:
- name: predict-otron-9000
image: ghcr.io/geoffsee/predict-otron-9000:latest
ports:
- containerPort: 8080
env:
- name: RUST_LOG
value: "info"
- name: SERVER_CONFIG
valueFrom:
configMapKeyRef:
name: server-config
key: SERVER_CONFIG
```
## API Compatibility
Both modes expose the same OpenAI-compatible API endpoints:
- `POST /v1/chat/completions` - Chat completions (streaming and non-streaming)
- `GET /v1/models` - List available models
- `POST /v1/embeddings` - Generate text embeddings
- `GET /health` - Health check
- `GET /` - Root endpoint
## Logging
The server logs the selected mode on startup:
**Local Mode:**
```
INFO predict_otron_9000: Running in Standalone mode
```
**HighAvailability Mode:**
```
INFO predict_otron_9000: Running in HighAvailability mode - proxying to external services
INFO predict_otron_9000: Inference service URL: http://inference-service:8080
INFO predict_otron_9000: Embeddings service URL: http://embeddings-service:8080
```
## Error Handling
- Invalid JSON in `SERVER_CONFIG` falls back to Local mode with a warning
- Missing `SERVER_CONFIG` defaults to Local mode
- Network errors to external services return HTTP 502 (Bad Gateway)
- Request/response proxying preserves original HTTP status codes and headers
## Performance Considerations
**Local Mode:**
- Lower latency (no network overhead)
- Higher memory usage (models loaded locally)
- Single point of failure
**HighAvailability Mode:**
- Higher latency (network requests)
- Lower memory usage (no local models)
- Horizontal scaling possible
- Network reliability dependent
- 5-minute timeout for long-running inference requests
## Troubleshooting
1. **Configuration not applied**: Check JSON syntax and restart the server
2. **External services unreachable**: Verify service URLs and network connectivity
3. **Timeouts**: Check if inference requests exceed the 5-minute timeout limit
4. **502 errors**: External services may be down or misconfigured
## Migration
To migrate from Local to HighAvailability mode:
1. Deploy separate inference and embeddings services
2. Update `SERVER_CONFIG` to point to the new services
3. Restart the predict-otron-9000 server
4. Verify endpoints are working with test requests
The API contract remains identical, ensuring zero-downtime migration possibilities.