mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
- Introduced ServerConfig
for handling deployment modes and services.
- Added HighAvailability mode for proxying requests to external services. - Maintained Local mode for embedded services. - Updated `README.md` and included `SERVER_CONFIG.md` for detailed documentation.
This commit is contained in:
223
docs/SERVER_CONFIG.md
Normal file
223
docs/SERVER_CONFIG.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# Server Configuration Guide
|
||||
|
||||
The predict-otron-9000 server supports two deployment modes controlled by the `SERVER_CONFIG` environment variable:
|
||||
|
||||
1. **Local Mode** (default): Runs inference and embeddings services locally within the main server process
|
||||
2. **HighAvailability Mode**: Proxies requests to external inference and embeddings services
|
||||
|
||||
## Configuration Format
|
||||
|
||||
The `SERVER_CONFIG` environment variable accepts a JSON configuration with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"serverMode": "Local",
|
||||
"services": {
|
||||
"inference_url": "http://inference-service:8080",
|
||||
"embeddings_url": "http://embeddings-service:8080"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```json
|
||||
{
|
||||
"serverMode": "HighAvailability",
|
||||
"services": {
|
||||
"inference_url": "http://inference-service:8080",
|
||||
"embeddings_url": "http://embeddings-service:8080"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Fields:**
|
||||
- `serverMode`: Either `"Local"` or `"HighAvailability"`
|
||||
- `services`: Optional object containing service URLs (uses defaults if not provided)
|
||||
|
||||
## Local Mode (Default)
|
||||
|
||||
If `SERVER_CONFIG` is not set or contains invalid JSON, the server defaults to Local mode.
|
||||
|
||||
### Example: Explicit Local Mode
|
||||
```bash
|
||||
export SERVER_CONFIG='{"serverMode": "Local"}'
|
||||
./run_server.sh
|
||||
```
|
||||
|
||||
In Local mode:
|
||||
- Inference requests are handled by the embedded inference engine
|
||||
- Embeddings requests are handled by the embedded embeddings engine
|
||||
- No external services are required
|
||||
- Supports all existing functionality without changes
|
||||
|
||||
## HighAvailability Mode
|
||||
|
||||
In HighAvailability mode, the server acts as a proxy, forwarding requests to external services.
|
||||
|
||||
### Example: Basic HighAvailability Mode
|
||||
```bash
|
||||
export SERVER_CONFIG='{"serverMode": "HighAvailability"}'
|
||||
./run_server.sh
|
||||
```
|
||||
|
||||
This uses the default service URLs:
|
||||
- Inference service: `http://inference-service:8080`
|
||||
- Embeddings service: `http://embeddings-service:8080`
|
||||
|
||||
### Example: Custom Service URLs
|
||||
```bash
|
||||
export SERVER_CONFIG='{
|
||||
"serverMode": "HighAvailability",
|
||||
"services": {
|
||||
"inference_url": "http://custom-inference:9000",
|
||||
"embeddings_url": "http://custom-embeddings:9001"
|
||||
}
|
||||
}'
|
||||
./run_server.sh
|
||||
```
|
||||
|
||||
## Docker Compose Example
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
# Inference service
|
||||
inference-service:
|
||||
image: ghcr.io/geoffsee/inference-service:latest
|
||||
ports:
|
||||
- "8081:8080"
|
||||
environment:
|
||||
- RUST_LOG=info
|
||||
|
||||
# Embeddings service
|
||||
embeddings-service:
|
||||
image: ghcr.io/geoffsee/embeddings-service:latest
|
||||
ports:
|
||||
- "8082:8080"
|
||||
environment:
|
||||
- RUST_LOG=info
|
||||
|
||||
# Main proxy server
|
||||
predict-otron-9000:
|
||||
image: ghcr.io/geoffsee/predict-otron-9000:latest
|
||||
ports:
|
||||
- "8080:8080"
|
||||
environment:
|
||||
- RUST_LOG=info
|
||||
- SERVER_CONFIG={"serverMode":"HighAvailability","services":{"inference_url":"http://inference-service:8080","embeddings_url":"http://embeddings-service:8080"}}
|
||||
depends_on:
|
||||
- inference-service
|
||||
- embeddings-service
|
||||
```
|
||||
|
||||
## Kubernetes Example
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: server-config
|
||||
data:
|
||||
SERVER_CONFIG: |
|
||||
{
|
||||
"serverMode": "HighAvailability",
|
||||
"services": {
|
||||
"inference_url": "http://inference-service:8080",
|
||||
"embeddings_url": "http://embeddings-service:8080"
|
||||
}
|
||||
}
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: predict-otron-9000
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: predict-otron-9000
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: predict-otron-9000
|
||||
spec:
|
||||
containers:
|
||||
- name: predict-otron-9000
|
||||
image: ghcr.io/geoffsee/predict-otron-9000:latest
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
env:
|
||||
- name: RUST_LOG
|
||||
value: "info"
|
||||
- name: SERVER_CONFIG
|
||||
valueFrom:
|
||||
configMapKeyRef:
|
||||
name: server-config
|
||||
key: SERVER_CONFIG
|
||||
```
|
||||
|
||||
## API Compatibility
|
||||
|
||||
Both modes expose the same OpenAI-compatible API endpoints:
|
||||
|
||||
- `POST /v1/chat/completions` - Chat completions (streaming and non-streaming)
|
||||
- `GET /v1/models` - List available models
|
||||
- `POST /v1/embeddings` - Generate text embeddings
|
||||
- `GET /health` - Health check
|
||||
- `GET /` - Root endpoint
|
||||
|
||||
## Logging
|
||||
|
||||
The server logs the selected mode on startup:
|
||||
|
||||
**Local Mode:**
|
||||
```
|
||||
INFO predict_otron_9000: Running in Local mode - using embedded services
|
||||
```
|
||||
|
||||
**HighAvailability Mode:**
|
||||
```
|
||||
INFO predict_otron_9000: Running in HighAvailability mode - proxying to external services
|
||||
INFO predict_otron_9000: Inference service URL: http://inference-service:8080
|
||||
INFO predict_otron_9000: Embeddings service URL: http://embeddings-service:8080
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Invalid JSON in `SERVER_CONFIG` falls back to Local mode with a warning
|
||||
- Missing `SERVER_CONFIG` defaults to Local mode
|
||||
- Network errors to external services return HTTP 502 (Bad Gateway)
|
||||
- Request/response proxying preserves original HTTP status codes and headers
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Local Mode:**
|
||||
- Lower latency (no network overhead)
|
||||
- Higher memory usage (models loaded locally)
|
||||
- Single point of failure
|
||||
|
||||
**HighAvailability Mode:**
|
||||
- Higher latency (network requests)
|
||||
- Lower memory usage (no local models)
|
||||
- Horizontal scaling possible
|
||||
- Network reliability dependent
|
||||
- 5-minute timeout for long-running inference requests
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
1. **Configuration not applied**: Check JSON syntax and restart the server
|
||||
2. **External services unreachable**: Verify service URLs and network connectivity
|
||||
3. **Timeouts**: Check if inference requests exceed the 5-minute timeout limit
|
||||
4. **502 errors**: External services may be down or misconfigured
|
||||
|
||||
## Migration
|
||||
|
||||
To migrate from Local to HighAvailability mode:
|
||||
|
||||
1. Deploy separate inference and embeddings services
|
||||
2. Update `SERVER_CONFIG` to point to the new services
|
||||
3. Restart the predict-otron-9000 server
|
||||
4. Verify endpoints are working with test requests
|
||||
|
||||
The API contract remains identical, ensuring zero-downtime migration possibilities.
|
Reference in New Issue
Block a user