predict-otron-9001/integration/llama-runner/README.md

# Llama Runner

A fast Rust implementation for running Llama and other language models using the Candle deep learning framework. Built on the official Candle examples with optimizations for speed and usability.

## Features

- 🚀 **High Performance**: Metal GPU acceleration on macOS, CUDA support on Linux/Windows
- 🤖 **Multiple Models**: Supports Llama 3.2, SmolLM2, TinyLlama, and more
- ⚡ **Fast Inference**: Optimized with F16 precision and KV caching
- 🎯 **Advanced Sampling**: Top-k, top-p, temperature, and repeat penalty controls
- 📊 **Performance Metrics**: Real-time tokens/second reporting
- 🔧 **Easy CLI**: Simple command-line interface with sensible defaults

## Supported Models

| Model | Size | Command | Description |
|-------|------|---------|-------------|
| SmolLM2-135M | 135M | `smollm2-135m` | Tiny, fast model for testing |
| SmolLM2-360M | 360M | `smollm2-360m` | Small, efficient model |
| SmolLM2-1.7B | 1.7B | `smollm2-1.7b` | Balanced performance/speed |
| Llama-3.2-1B | 1B | `llama-3.2-1b` | Meta's compact model |
| Llama-3.2-3B | 3B | `llama-3.2-3b` | Larger Llama model |
| TinyLlama-1.1B | 1.1B | `tinyllama-1.1b-chat` | Chat-optimized small model |

Add `-instruct` suffix for instruction-tuned variants (e.g., `smollm2-135m-instruct`).

## Installation

```bash
# Clone the repository
git clone <repository-url>
cd llama-runner

# Build with GPU acceleration (recommended)
cargo build --release --features metal  # macOS
cargo build --release --features cuda   # Linux/Windows with NVIDIA GPU

# CPU-only build
cargo build --release
```

## Quick Start

```bash
# Fast inference with GPU acceleration
cargo run --features metal -- --prompt "What is quantum computing?"

# Specify a model and parameters
cargo run --features metal -- \
  --prompt "Write a short story about space exploration" \
  --model smollm2-360m \
  --max-tokens 100 \
  --temperature 0.8

# Use CPU (slower but works everywhere)
cargo run -- --prompt "Hello, world!" --model smollm2-135m --cpu
```

## Usage Examples

### Basic Text Generation
```bash
# Simple completion
cargo run --features metal -- --prompt "The capital of France is"

# Creative writing with higher temperature
cargo run --features metal -- \
  --prompt "Once upon a time" \
  --temperature 1.0 \
  --max-tokens 200
```

### Advanced Sampling
```bash
# Top-k and top-p sampling
cargo run --features metal -- \
  --prompt "Explain artificial intelligence" \
  --top-k 40 \
  --top-p 0.9 \
  --temperature 0.7

# Reduce repetition
cargo run --features metal -- \
  --prompt "List the benefits of renewable energy" \
  --repeat-penalty 1.2 \
  --repeat-last-n 64
```

### Different Models
```bash
# Ultra-fast with tiny model
cargo run --features metal -- \
  --prompt "Quick test" \
  --model smollm2-135m

# Better quality with larger model
cargo run --features metal -- \
  --prompt "Explain quantum physics" \
  --model llama-3.2-1b \
  --max-tokens 150
```

## Command-Line Options

| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--prompt` | `-p` | "The capital of France is" | Input prompt |
| `--model` | `-m` | `smollm2-135m` | Model to use |
| `--max-tokens` | `-n` | 100 | Maximum tokens to generate |
| `--temperature` | `-t` | 0.8 | Sampling temperature (0.0 = deterministic) |
| `--top-k` | | None | Top-k sampling |
| `--top-p` | | None | Top-p (nucleus) sampling |
| `--seed` | | 299792458 | Random seed for reproducibility |
| `--repeat-penalty` | | 1.1 | Repetition penalty (1.0 = no penalty) |
| `--repeat-last-n` | | 128 | Context window for repeat penalty |
| `--cpu` | | false | Force CPU usage |
| `--dtype` | | f16 | Data type: f16, bf16, f32 |
| `--no-kv-cache` | | false | Disable key-value caching |

## Performance

Typical performance on Apple M2 with Metal acceleration:

| Model | Size | Speed | Memory |
|-------|------|-------|--------|
| SmolLM2-135M | 135M | ~100 tok/s | ~500MB |
| SmolLM2-360M | 360M | ~80 tok/s | ~1GB |
| SmolLM2-1.7B | 1.7B | ~50 tok/s | ~3GB |
| Llama-3.2-1B | 1B | ~40 tok/s | ~2GB |

## Requirements

- **Rust**: 1.70+ (latest stable recommended)
- **Memory**: 2-8GB RAM depending on model size
- **Storage**: 1-10GB for model weights
- **Network**: Internet connection for first-time model download
- **GPU** (optional): Metal on macOS, CUDA on Linux/Windows

## GPU Support

### macOS (Metal)
```bash
cargo run --features metal -- [options]
```

### Linux/Windows (CUDA)
```bash
cargo run --features cuda -- [options]
```

### CPU Only
```bash
cargo run -- --cpu [options]
```

## Model Downloads

Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Download times:

- SmolLM2-135M: ~1 minute
- SmolLM2-360M: ~2 minutes
- Llama-3.2-1B: ~5 minutes
- Larger models: 10+ minutes

## Troubleshooting

### Slow Performance
- Use `--features metal` on macOS or `--features cuda` on Linux/Windows
- Try smaller models like `smollm2-135m` for faster inference
- Ensure sufficient RAM for your chosen model

### Out of Memory
- Use `--cpu` to use system RAM instead of GPU memory
- Try smaller models or reduce `--max-tokens`
- Use `--dtype f32` if f16 causes issues

### Model Download Issues
- Check internet connection
- Some models may require HuggingFace Hub authentication
- Verify sufficient disk space in `~/.cache/huggingface/`

## Contributing

Contributions welcome! This project is based on the [Candle](https://github.com/huggingface/candle) framework by HuggingFace.

## License

MIT License - see LICENSE file for details.