Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-09-08 22:46:44 +00:00 · 2025-08-26 01:30:26 -04:00
parent 7dd23213c9
commit 8338750beb
64 changed files with 14997 additions and 220 deletions
--- a/run_server.sh
+++ b/run_server.sh
@@ -1,3 +1,7 @@
-#!/usr/bin/env sh
+#!/bin/bash

-cargo run --bin predict-otron-9000
+# Start the unified predict-otron-9000 server on port 8080
+export SERVER_PORT=${SERVER_PORT:-8080}
+export RUST_LOG=${RUST_LOG:-info}
+
+cargo run --bin predict-otron-9000 --release