Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
This commit is contained in:
geoffsee
2025-08-26 01:30:26 -04:00
parent 7dd23213c9
commit 8338750beb
64 changed files with 14997 additions and 220 deletions

View File

@@ -15,6 +15,9 @@ Aliens, in a native executable.
- **OpenAI Compatible**: API endpoints match OpenAI's format for easy integration
- **Text Embeddings**: Generate high-quality text embeddings using the Nomic Embed Text v1.5 model
- **Text Generation**: Chat completions with OpenAI-compatible API (simplified implementation)
- **Performance Optimized**: Implements efficient caching and singleton patterns for improved throughput and reduced latency
- **Performance Benchmarking**: Includes tools for measuring performance and generating HTML reports
- **Web Chat Interface**: A Leptos-based WebAssembly chat interface for interacting with the inference engine
## Architecture
@@ -23,6 +26,7 @@ Aliens, in a native executable.
- **`predict-otron-9000`**: Main unified server that combines both engines
- **`embeddings-engine`**: Handles text embeddings using FastEmbed and Nomic models
- **`inference-engine`**: Provides text generation capabilities (with modular design for various models)
- **`leptos-chat`**: WebAssembly-based chat interface built with Leptos framework for interacting with the inference engine
## Installation
@@ -202,6 +206,10 @@ cargo test -p embeddings-engine
cargo test -p inference-engine
```
For comprehensive testing documentation, including unit tests, integration tests, end-to-end tests, and performance testing, please refer to the [TESTING.md](docs/TESTING.md) document.
For performance benchmarking with HTML report generation, see the [BENCHMARKING.md](BENCHMARKING.md) guide.
### Adding Features
1. **Embeddings Engine**: Modify `crates/embeddings-engine/src/lib.rs` to add new embedding models or functionality
@@ -223,11 +231,42 @@ export RUST_LOG=trace
export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace
```
## Chat Interface
The project includes a WebAssembly-based chat interface built with the Leptos framework.
### Building the Chat Interface
```shell
# Navigate to the leptos-chat crate
cd crates/leptos-chat
# Build the WebAssembly package
cargo build --target wasm32-unknown-unknown
# For development with trunk (if installed)
trunk serve
```
### Usage
The chat interface connects to the inference engine API and provides a user-friendly way to interact with the AI models. To use:
1. Start the predict-otron-9000 server
2. Open the chat interface in a web browser
3. Enter messages and receive AI-generated responses
The interface supports:
- Real-time messaging with the AI
- Visual indication of when the AI is generating a response
- Message history display
## Limitations
- **Inference Engine**: Currently provides a simplified implementation for chat completions. Full model loading and text generation capabilities from the inference-engine crate are not yet integrated into the unified server.
- **Model Support**: Embeddings are limited to the Nomic Embed Text v1.5 model.
- **Scalability**: Single-threaded model loading may impact performance under heavy load.
- **Chat Interface**: The WebAssembly chat interface requires compilation to a static site before deployment.
## Contributing
@@ -235,4 +274,47 @@ export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and add tests
4. Ensure all tests pass: `cargo test`
5. Submit a pull request
5. Submit a pull request
## Quick cURL verification for Chat Endpoints
Start the unified server:
```
./run_server.sh
```
Non-streaming chat completion (expects JSON response):
```
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-it",
"messages": [
{"role": "user", "content": "Who was the 16th president of the United States?"}
],
"max_tokens": 128,
"stream": false
}'
```
Streaming chat completion via Server-Sent Events (SSE):
```
curl -N -X POST http://localhost:8080/v1/chat/completions/stream \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-it",
"messages": [
{"role": "user", "content": "Who was the 16th president of the United States?"}
],
"max_tokens": 128,
"stream": true
}'
```
Helper scripts are also available:
- scripts/curl_chat.sh
- scripts/curl_chat_stream.sh