mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
211 lines
5.3 KiB
Markdown
211 lines
5.3 KiB
Markdown
# @open-web-agent-rs/legacy-inference-engine
|
|
|
|
## Note
|
|
This is here as a reference implementation. This is harder than it looks.
|
|
|
|
|
|
A Rust-based inference engine for running large language models locally. This tool supports both CLI mode for direct text generation and server mode with an OpenAI-compatible API.
|
|
|
|
## Features
|
|
|
|
- Run Gemma models locally (1B, 2B, 7B, 9B variants)
|
|
- CLI mode for direct text generation
|
|
- Server mode with OpenAI-compatible API
|
|
- Support for various model configurations (base, instruction-tuned)
|
|
- Metal acceleration on macOS
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Rust toolchain (install via [rustup](https://rustup.rs/))
|
|
- Cargo package manager
|
|
- For GPU acceleration:
|
|
- macOS: Metal support
|
|
- Linux/Windows: CUDA support (requires appropriate drivers)
|
|
|
|
### Building from Source
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/seemueller-io/open-web-agent-rs.git
|
|
cd open-web-agent-rs
|
|
```
|
|
|
|
2. Build the local inference engine:
|
|
```bash
|
|
cargo build -p legacy-inference-engine --release
|
|
```
|
|
|
|
## Usage
|
|
|
|
### CLI Mode
|
|
|
|
Run the inference engine in CLI mode to generate text directly:
|
|
|
|
```bash
|
|
cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it
|
|
```
|
|
|
|
#### CLI Options
|
|
|
|
- `--prompt <TEXT>`: The prompt text to generate from
|
|
- `--which <MODEL>`: Model variant to use (default: "3-1b-it")
|
|
- `--server`: Run OpenAI compatible server
|
|
- Available options: "2b", "7b", "2b-it", "7b-it", "1.1-2b-it", "1.1-7b-it", "code-2b", "code-7b", "code-2b-it", "code-7b-it", "2-2b", "2-2b-it", "2-9b", "2-9b-it", "3-1b", "3-1b-it"
|
|
- `--temperature <FLOAT>`: Temperature for sampling (higher = more random)
|
|
- `--top-p <FLOAT>`: Nucleus sampling probability cutoff
|
|
- `--sample-len <INT>`: Maximum number of tokens to generate (default: 10000)
|
|
- `--repeat-penalty <FLOAT>`: Penalty for repeating tokens (default: 1.1)
|
|
- `--repeat-last-n <INT>`: Context size for repeat penalty (default: 64)
|
|
- `--cpu`: Run on CPU instead of GPU
|
|
- `--tracing`: Enable tracing (generates a trace-timestamp.json file)
|
|
|
|
### Server Mode with OpenAI-compatible API
|
|
|
|
Run the inference engine in server mode to expose an OpenAI-compatible API:
|
|
|
|
```bash
|
|
cargo run -p legacy-inference-engine --release -- --server --port 3777 --which 3-1b-it
|
|
```
|
|
|
|
This starts a web server on the specified port (default: 3777) with an OpenAI-compatible chat completions endpoint.
|
|
|
|
#### Server Options
|
|
|
|
- `--server`: Run in server mode
|
|
- `--port <INT>`: Port to use for the server (default: 3777)
|
|
- `--which <MODEL>`: Model variant to use (default: "3-1b-it")
|
|
- Other model options as described in CLI mode
|
|
|
|
## API Usage
|
|
|
|
The server exposes an OpenAI-compatible chat completions endpoint:
|
|
|
|
### Chat Completions
|
|
|
|
```
|
|
POST /v1/chat/completions
|
|
```
|
|
|
|
#### Request Format
|
|
|
|
```json
|
|
{
|
|
"model": "gemma-3-1b-it",
|
|
"messages": [
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Hello, how are you?"}
|
|
],
|
|
"temperature": 0.7,
|
|
"max_tokens": 256,
|
|
"top_p": 0.9,
|
|
"stream": false
|
|
}
|
|
```
|
|
|
|
#### Response Format
|
|
|
|
```json
|
|
{
|
|
"id": "chatcmpl-123abc456def789ghi",
|
|
"object": "chat.completion",
|
|
"created": 1677858242,
|
|
"model": "gemma-3-1b-it",
|
|
"choices": [
|
|
{
|
|
"index": 0,
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": "I'm doing well, thank you for asking! How can I assist you today?"
|
|
},
|
|
"finish_reason": "stop"
|
|
}
|
|
],
|
|
"usage": {
|
|
"prompt_tokens": 25,
|
|
"completion_tokens": 15,
|
|
"total_tokens": 40
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example: Using cURL
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3777/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "gemma-3-1b-it",
|
|
"messages": [
|
|
{"role": "user", "content": "What is the capital of France?"}
|
|
],
|
|
"temperature": 0.7,
|
|
"max_tokens": 100
|
|
}'
|
|
```
|
|
|
|
### Example: Using Python with OpenAI Client
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
base_url="http://localhost:3777/v1",
|
|
api_key="dummy" # API key is not validated but required by the client
|
|
)
|
|
|
|
response = client.chat.completions.create(
|
|
model="gemma-3-1b-it",
|
|
messages=[
|
|
{"role": "user", "content": "What is the capital of France?"}
|
|
],
|
|
temperature=0.7,
|
|
max_tokens=100
|
|
)
|
|
|
|
print(response.choices[0].message.content)
|
|
```
|
|
|
|
### Example: Using JavaScript/TypeScript with OpenAI SDK
|
|
|
|
```javascript
|
|
import OpenAI from 'openai';
|
|
|
|
const openai = new OpenAI({
|
|
baseURL: 'http://localhost:3777/v1',
|
|
apiKey: 'dummy', // API key is not validated but required by the client
|
|
});
|
|
|
|
async function main() {
|
|
const response = await openai.chat.completions.create({
|
|
model: 'gemma-3-1b-it',
|
|
messages: [
|
|
{ role: 'user', content: 'What is the capital of France?' }
|
|
],
|
|
temperature: 0.7,
|
|
max_tokens: 100,
|
|
});
|
|
|
|
console.log(response.choices[0].message.content);
|
|
}
|
|
|
|
main();
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Model download errors**: Make sure you have a stable internet connection. The models are downloaded from Hugging Face Hub.
|
|
|
|
2. **Out of memory errors**: Try using a smaller model variant or reducing the batch size.
|
|
|
|
3. **Slow inference on CPU**: This is expected. For better performance, use GPU acceleration if available.
|
|
|
|
4. **Metal/CUDA errors**: Ensure you have the latest drivers installed for your GPU.
|
|
|
|
## License
|
|
|
|
This project is licensed under the terms specified in the LICENSE file.
|