supports small llama and gemma models

Refactor inference dedicated crates for llama and gemma inferencing, not integrated
2025-09-08 22:46:44 +00:00 · 2025-08-29 18:15:29 -04:00
parent d06b16bb12
commit 315ef17605
26 changed files with 2136 additions and 1402 deletions
--- a/crates/gemma-runner/README.md
+++ b/crates/gemma-runner/README.md
@@ -0,0 +1,137 @@
+# Gemma Runner
+
+Fast Gemma inference with Candle framework in Rust.
+
+## Features
+
+- Support for multiple Gemma model versions (v1, v2, v3)
+- GPU acceleration with CUDA and Metal
+- Configurable sampling parameters
+- Multiple model variants including instruct and code models
+
+## Supported Models
+
+### Gemma v1
+- `gemma-2b` - Base 2B model
+- `gemma-7b` - Base 7B model  
+- `gemma-2b-it` - Instruct 2B model
+- `gemma-7b-it` - Instruct 7B model
+- `gemma-1.1-2b-it` - Instruct 2B v1.1 model
+- `gemma-1.1-7b-it` - Instruct 7B v1.1 model
+
+### CodeGemma
+- `codegemma-2b` - Code base 2B model
+- `codegemma-7b` - Code base 7B model
+- `codegemma-2b-it` - Code instruct 2B model
+- `codegemma-7b-it` - Code instruct 7B model
+
+### Gemma v2
+- `gemma-2-2b` - Base 2B v2 model (default)
+- `gemma-2-2b-it` - Instruct 2B v2 model
+- `gemma-2-9b` - Base 9B v2 model
+- `gemma-2-9b-it` - Instruct 9B v2 model
+
+### Gemma v3
+- `gemma-3-1b` - Base 1B v3 model
+- `gemma-3-1b-it` - Instruct 1B v3 model
+
+## Installation
+
+```bash
+cd gemma-runner
+cargo build --release
+```
+
+For GPU support:
+```bash
+# CUDA
+cargo build --release --features cuda
+
+# Metal (macOS)
+cargo build --release --features metal
+```
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Run with default model (gemma-2-2b)
+cargo run -- --prompt "The capital of France is"
+
+# Specify a different model
+cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"
+
+# Generate more tokens
+cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200
+```
+
+### Advanced Options
+
+```bash
+# Use CPU instead of GPU
+cargo run -- --cpu --prompt "Hello world"
+
+# Adjust sampling parameters
+cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"
+
+# Use custom model from HuggingFace Hub
+cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"
+
+# Enable tracing for performance analysis
+cargo run -- --tracing --prompt "Explain machine learning"
+```
+
+### Command Line Arguments
+
+- `--prompt, -p` - The prompt to generate text from (default: "The capital of France is")
+- `--model, -m` - The model to use (default: "gemma-2-2b")
+- `--cpu` - Run on CPU rather than GPU
+- `--temperature, -t` - Sampling temperature (optional)
+- `--top-p` - Nucleus sampling probability cutoff (optional)
+- `--seed` - Random seed (default: 299792458)
+- `--max-tokens, -n` - Maximum tokens to generate (default: 100)
+- `--model-id` - Custom model ID from HuggingFace Hub
+- `--revision` - Model revision (default: "main")
+- `--use-flash-attn` - Use flash attention
+- `--repeat-penalty` - Repetition penalty (default: 1.1)
+- `--repeat-last-n` - Context size for repeat penalty (default: 64)
+- `--dtype` - Data type (f16, bf16, f32)
+- `--tracing` - Enable performance tracing
+
+## Examples
+
+### Text Generation
+```bash
+cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150
+```
+
+### Code Generation
+```bash
+cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100
+```
+
+### Creative Writing
+```bash
+cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200
+```
+
+### Chat with Gemma 3 (Instruct format)
+```bash
+cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"
+```
+
+## Performance Notes
+
+- GPU acceleration is automatically detected and used when available
+- BF16 precision is used on CUDA for better performance
+- F32 precision is used on CPU
+- Flash attention can be enabled with `--use-flash-attn` for supported models
+- Model files are cached locally after first download
+
+## Requirements
+
+- Rust 1.70+
+- CUDA toolkit (for CUDA support)
+- Metal (automatically available on macOS)
+- Internet connection for first-time model download