
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
5.3 KiB
@open-web-agent-rs/legacy-inference-engine
Note
This is here as a reference implementation. This is harder than it looks.
A Rust-based inference engine for running large language models locally. This tool supports both CLI mode for direct text generation and server mode with an OpenAI-compatible API.
Features
- Run Gemma models locally (1B, 2B, 7B, 9B variants)
- CLI mode for direct text generation
- Server mode with OpenAI-compatible API
- Support for various model configurations (base, instruction-tuned)
- Metal acceleration on macOS
Installation
Prerequisites
- Rust toolchain (install via rustup)
- Cargo package manager
- For GPU acceleration:
- macOS: Metal support
- Linux/Windows: CUDA support (requires appropriate drivers)
Building from Source
-
Clone the repository:
git clone https://github.com/seemueller-io/open-web-agent-rs.git cd open-web-agent-rs
-
Build the local inference engine:
cargo build -p legacy-inference-engine --release
Usage
CLI Mode
Run the inference engine in CLI mode to generate text directly:
cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it
CLI Options
--prompt <TEXT>
: The prompt text to generate from--which <MODEL>
: Model variant to use (default: "3-1b-it")--server
: Run OpenAI compatible server- Available options: "2b", "7b", "2b-it", "7b-it", "1.1-2b-it", "1.1-7b-it", "code-2b", "code-7b", "code-2b-it", "code-7b-it", "2-2b", "2-2b-it", "2-9b", "2-9b-it", "3-1b", "3-1b-it"
--temperature <FLOAT>
: Temperature for sampling (higher = more random)--top-p <FLOAT>
: Nucleus sampling probability cutoff--sample-len <INT>
: Maximum number of tokens to generate (default: 10000)--repeat-penalty <FLOAT>
: Penalty for repeating tokens (default: 1.1)--repeat-last-n <INT>
: Context size for repeat penalty (default: 64)--cpu
: Run on CPU instead of GPU--tracing
: Enable tracing (generates a trace-timestamp.json file)
Server Mode with OpenAI-compatible API
Run the inference engine in server mode to expose an OpenAI-compatible API:
cargo run -p legacy-inference-engine --release -- --server --port 3777 --which 3-1b-it
This starts a web server on the specified port (default: 3777) with an OpenAI-compatible chat completions endpoint.
Server Options
--server
: Run in server mode--port <INT>
: Port to use for the server (default: 3777)--which <MODEL>
: Model variant to use (default: "3-1b-it")- Other model options as described in CLI mode
API Usage
The server exposes an OpenAI-compatible chat completions endpoint:
Chat Completions
POST /v1/chat/completions
Request Format
{
"model": "gemma-3-1b-it",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"max_tokens": 256,
"top_p": 0.9,
"stream": false
}
Response Format
{
"id": "chatcmpl-123abc456def789ghi",
"object": "chat.completion",
"created": 1677858242,
"model": "gemma-3-1b-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I'm doing well, thank you for asking! How can I assist you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 15,
"total_tokens": 40
}
}
Example: Using cURL
curl -X POST http://localhost:3777/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-it",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
Example: Using Python with OpenAI Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3777/v1",
api_key="dummy" # API key is not validated but required by the client
)
response = client.chat.completions.create(
model="gemma-3-1b-it",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content)
Example: Using JavaScript/TypeScript with OpenAI SDK
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'http://localhost:3777/v1',
apiKey: 'dummy', // API key is not validated but required by the client
});
async function main() {
const response = await openai.chat.completions.create({
model: 'gemma-3-1b-it',
messages: [
{ role: 'user', content: 'What is the capital of France?' }
],
temperature: 0.7,
max_tokens: 100,
});
console.log(response.choices[0].message.content);
}
main();
Troubleshooting
Common Issues
-
Model download errors: Make sure you have a stable internet connection. The models are downloaded from Hugging Face Hub.
-
Out of memory errors: Try using a smaller model variant or reducing the batch size.
-
Slow inference on CPU: This is expected. For better performance, use GPU acceleration if available.
-
Metal/CUDA errors: Ensure you have the latest drivers installed for your GPU.
License
This project is licensed under the terms specified in the LICENSE file.