Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-09-08 22:46:44 +00:00 · 2025-08-26 01:30:26 -04:00
parent 7dd23213c9
commit 8338750beb
64 changed files with 14997 additions and 220 deletions
--- a/README.md
+++ b/README.md
@@ -15,6 +15,9 @@ Aliens, in a native executable.
 - **OpenAI Compatible**: API endpoints match OpenAI's format for easy integration
 - **Text Embeddings**: Generate high-quality text embeddings using the Nomic Embed Text v1.5 model
 - **Text Generation**: Chat completions with OpenAI-compatible API (simplified implementation)
+- **Performance Optimized**: Implements efficient caching and singleton patterns for improved throughput and reduced latency
+- **Performance Benchmarking**: Includes tools for measuring performance and generating HTML reports
+- **Web Chat Interface**: A Leptos-based WebAssembly chat interface for interacting with the inference engine

 ## Architecture

@@ -23,6 +26,7 @@ Aliens, in a native executable.
 - **`predict-otron-9000`**: Main unified server that combines both engines
 - **`embeddings-engine`**: Handles text embeddings using FastEmbed and Nomic models
 - **`inference-engine`**: Provides text generation capabilities (with modular design for various models)
+- **`leptos-chat`**: WebAssembly-based chat interface built with Leptos framework for interacting with the inference engine

 ## Installation

@@ -202,6 +206,10 @@ cargo test -p embeddings-engine
 cargo test -p inference-engine
 ```

+For comprehensive testing documentation, including unit tests, integration tests, end-to-end tests, and performance testing, please refer to the [TESTING.md](docs/TESTING.md) document.
+
+For performance benchmarking with HTML report generation, see the [BENCHMARKING.md](BENCHMARKING.md) guide.
+
 ### Adding Features

 1. **Embeddings Engine**: Modify `crates/embeddings-engine/src/lib.rs` to add new embedding models or functionality
@@ -223,11 +231,42 @@ export RUST_LOG=trace
 export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace
 ```

+## Chat Interface
+
+The project includes a WebAssembly-based chat interface built with the Leptos framework.
+
+### Building the Chat Interface
+
+```shell
+# Navigate to the leptos-chat crate
+cd crates/leptos-chat
+
+# Build the WebAssembly package
+cargo build --target wasm32-unknown-unknown
+
+# For development with trunk (if installed)
+trunk serve
+```
+
+### Usage
+
+The chat interface connects to the inference engine API and provides a user-friendly way to interact with the AI models. To use:
+
+1. Start the predict-otron-9000 server
+2. Open the chat interface in a web browser
+3. Enter messages and receive AI-generated responses
+
+The interface supports:
+- Real-time messaging with the AI
+- Visual indication of when the AI is generating a response
+- Message history display
+
 ## Limitations

 - **Inference Engine**: Currently provides a simplified implementation for chat completions. Full model loading and text generation capabilities from the inference-engine crate are not yet integrated into the unified server.
 - **Model Support**: Embeddings are limited to the Nomic Embed Text v1.5 model.
 - **Scalability**: Single-threaded model loading may impact performance under heavy load.
+- **Chat Interface**: The WebAssembly chat interface requires compilation to a static site before deployment.

 ## Contributing

@@ -235,4 +274,47 @@ export RUST_LOG=predict_otron_9000=debug,embeddings_engine=trace
 2. Create a feature branch: `git checkout -b feature-name`
 3. Make your changes and add tests
 4. Ensure all tests pass: `cargo test`
-5. Submit a pull request
+5. Submit a pull request
+
+
+## Quick cURL verification for Chat Endpoints
+
+Start the unified server:
+
+```
+./run_server.sh
+```
+
+Non-streaming chat completion (expects JSON response):
+
+```
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemma-3-1b-it",
+    "messages": [
+      {"role": "user", "content": "Who was the 16th president of the United States?"}
+    ],
+    "max_tokens": 128,
+    "stream": false
+  }'
+```
+
+Streaming chat completion via Server-Sent Events (SSE):
+
+```
+curl -N -X POST http://localhost:8080/v1/chat/completions/stream \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemma-3-1b-it",
+    "messages": [
+      {"role": "user", "content": "Who was the 16th president of the United States?"}
+    ],
+    "max_tokens": 128,
+    "stream": true
+  }'
+```
+
+Helper scripts are also available:
+- scripts/curl_chat.sh
+- scripts/curl_chat_stream.sh