mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
supports small llama and gemma models
Refactor inference dedicated crates for llama and gemma inferencing, not integrated
This commit is contained in:
137
crates/gemma-runner/README.md
Normal file
137
crates/gemma-runner/README.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Gemma Runner
|
||||
|
||||
Fast Gemma inference with Candle framework in Rust.
|
||||
|
||||
## Features
|
||||
|
||||
- Support for multiple Gemma model versions (v1, v2, v3)
|
||||
- GPU acceleration with CUDA and Metal
|
||||
- Configurable sampling parameters
|
||||
- Multiple model variants including instruct and code models
|
||||
|
||||
## Supported Models
|
||||
|
||||
### Gemma v1
|
||||
- `gemma-2b` - Base 2B model
|
||||
- `gemma-7b` - Base 7B model
|
||||
- `gemma-2b-it` - Instruct 2B model
|
||||
- `gemma-7b-it` - Instruct 7B model
|
||||
- `gemma-1.1-2b-it` - Instruct 2B v1.1 model
|
||||
- `gemma-1.1-7b-it` - Instruct 7B v1.1 model
|
||||
|
||||
### CodeGemma
|
||||
- `codegemma-2b` - Code base 2B model
|
||||
- `codegemma-7b` - Code base 7B model
|
||||
- `codegemma-2b-it` - Code instruct 2B model
|
||||
- `codegemma-7b-it` - Code instruct 7B model
|
||||
|
||||
### Gemma v2
|
||||
- `gemma-2-2b` - Base 2B v2 model (default)
|
||||
- `gemma-2-2b-it` - Instruct 2B v2 model
|
||||
- `gemma-2-9b` - Base 9B v2 model
|
||||
- `gemma-2-9b-it` - Instruct 9B v2 model
|
||||
|
||||
### Gemma v3
|
||||
- `gemma-3-1b` - Base 1B v3 model
|
||||
- `gemma-3-1b-it` - Instruct 1B v3 model
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd gemma-runner
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
For GPU support:
|
||||
```bash
|
||||
# CUDA
|
||||
cargo build --release --features cuda
|
||||
|
||||
# Metal (macOS)
|
||||
cargo build --release --features metal
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Run with default model (gemma-2-2b)
|
||||
cargo run -- --prompt "The capital of France is"
|
||||
|
||||
# Specify a different model
|
||||
cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"
|
||||
|
||||
# Generate more tokens
|
||||
cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
```bash
|
||||
# Use CPU instead of GPU
|
||||
cargo run -- --cpu --prompt "Hello world"
|
||||
|
||||
# Adjust sampling parameters
|
||||
cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"
|
||||
|
||||
# Use custom model from HuggingFace Hub
|
||||
cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"
|
||||
|
||||
# Enable tracing for performance analysis
|
||||
cargo run -- --tracing --prompt "Explain machine learning"
|
||||
```
|
||||
|
||||
### Command Line Arguments
|
||||
|
||||
- `--prompt, -p` - The prompt to generate text from (default: "The capital of France is")
|
||||
- `--model, -m` - The model to use (default: "gemma-2-2b")
|
||||
- `--cpu` - Run on CPU rather than GPU
|
||||
- `--temperature, -t` - Sampling temperature (optional)
|
||||
- `--top-p` - Nucleus sampling probability cutoff (optional)
|
||||
- `--seed` - Random seed (default: 299792458)
|
||||
- `--max-tokens, -n` - Maximum tokens to generate (default: 100)
|
||||
- `--model-id` - Custom model ID from HuggingFace Hub
|
||||
- `--revision` - Model revision (default: "main")
|
||||
- `--use-flash-attn` - Use flash attention
|
||||
- `--repeat-penalty` - Repetition penalty (default: 1.1)
|
||||
- `--repeat-last-n` - Context size for repeat penalty (default: 64)
|
||||
- `--dtype` - Data type (f16, bf16, f32)
|
||||
- `--tracing` - Enable performance tracing
|
||||
|
||||
## Examples
|
||||
|
||||
### Text Generation
|
||||
```bash
|
||||
cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150
|
||||
```
|
||||
|
||||
### Code Generation
|
||||
```bash
|
||||
cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100
|
||||
```
|
||||
|
||||
### Creative Writing
|
||||
```bash
|
||||
cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200
|
||||
```
|
||||
|
||||
### Chat with Gemma 3 (Instruct format)
|
||||
```bash
|
||||
cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- GPU acceleration is automatically detected and used when available
|
||||
- BF16 precision is used on CUDA for better performance
|
||||
- F32 precision is used on CPU
|
||||
- Flash attention can be enabled with `--use-flash-attn` for supported models
|
||||
- Model files are cached locally after first download
|
||||
|
||||
## Requirements
|
||||
|
||||
- Rust 1.70+
|
||||
- CUDA toolkit (for CUDA support)
|
||||
- Metal (automatically available on macOS)
|
||||
- Internet connection for first-time model download
|
Reference in New Issue
Block a user