Files
predict-otron-9001/crates/legacy-inference-engine/ROOT_CAUSE_ANALYSIS.md
geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00

128 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Root Cause Analysis: Metal error "no metal implementation for rotary-emb"
Date: 2025-08-27
Component: crates/legacy-inference-engine
Command to reproduce: crates/legacy-inference-engine/test_cli.sh
## Summary
Running the CLI with the default model (--which 3-1b-it, i.e., Gemma 3 1B Instruct) on an Apple Silicon Mac results in a runtime failure:
```
modelError: Metal error no metal implementation for rotary-emb
Caused by:
no metal implementation for rotary-emb
```
This occurs because the project targets the Candle Metal (MPS) backend on macOS, but the Candle version in use (0.9.1) does not provide a Metal kernel implementation for the rotary embedding operation required by Gemma 3 models. The program selects the Metal device by default on macOS and hits this missing kernel during the attention computation.
## Environment and build configuration
- Machine: 2024 MacBook Pro, Apple Silicon (M4 Max)
- Crate: legacy-inference-engine
- Candle versions: pinned to =0.9.1
- candle-core = "=0.9.1"
- candle-transformers = "=0.9.1"
- macOS-specific dependency enabling Metal (file: crates/legacy-inference-engine/Cargo.toml):
```text
[target.'cfg(target_os = "macos")'.dependencies]
candle-core = { version = "=0.9.1", features = ["metal"] }
metal = { version = "0.32.0", features = ["mps"] }
```
- Run command (attached script): crates/legacy-inference-engine/test_cli.sh
```text
cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it
```
## What the code does at runtime
1) Device selection (defaults to Metal on macOS if available):
- File: crates/legacy-inference-engine/src/utilities_lib.rs (lines 412)
```text
pub fn device(cpu: bool) -> Result<Device> {
if cpu {
Ok(Device::Cpu)
} else if cuda_is_available() {
Ok(Device::new_cuda(0)?)
} else if metal_is_available() {
Ok(Device::new_metal(0)?)
} else {
// ... falls back to CPU
Ok(Device::Cpu)
}
}
```
- The CLI does not pass --cpu, so on Apple Silicon with Metal available, Device::new_metal(0) is selected.
2) Default model selection is Gemma 3 1B Instruct:
- File: crates/legacy-inference-engine/src/main.rs
- Arg default (lines 705707):
```text
/// The model to use.
#[arg(long, default_value = "3-1b-it")]
which: Which,
```
- Model id resolution (lines 758760):
```text
Which::BaseV3_1B => "google/gemma-3-1b-pt".to_string(),
Which::InstructV3_1B => "google/gemma-3-1b-it".to_string(),
```
- Model loading uses Model3 (Gemma 3) for Which::BaseV3_1B | Which::InstructV3_1B (lines 817821).
3) During generation, the Gemma 3 attention path requires rotary embeddings. On the Metal backend in Candle 0.9.1, the rotary embedding op is not implemented, resulting in the runtime error.
## Additional build-time signal (misleading but not causal)
- File: crates/legacy-inference-engine/src/main.rs (lines 1011)
```text
#[cfg(feature = "metal")]
extern crate metal_src;
```
- Build warning: unexpected cfg condition value: metal
Explanation: The project does not define a Cargo feature named "metal"; instead, Metal is enabled via target-specific dependency features in Cargo.toml. This cfg gate is ineffective and triggers a warning. It does not cause the runtime failure; it just indicates confusing/obsolete gating.
## Root cause
- The program runs on the Candle Metal backend (MPS) due to device auto-selection on macOS.
- The selected model (Gemma 3 1B Instruct) requires the rotary embedding operation in its attention mechanism.
- Candle 0.9.1s Metal backend lacks an implementation for the rotary-emb kernel. When the model executes on Metal, it attempts to invoke this operation and fails with: "no metal implementation for rotary-emb".
## Evidence
- Runtime log shows the failure immediately after model load when inference begins.
- Code paths confirm: device defaults to Metal on macOS; default model is Gemma 3; Gemma 3 uses rotary embeddings.
- Candle version pinned to 0.9.1 where rotary-emb on Metal is not available.
## Impact
- Any attempt to run Gemma 3 (and possibly other rotary-embedding reliant models) on the Metal backend with Candle 0.9.1 will fail at runtime on macOS.
## Workarounds and remediation options
1) Immediate workarounds:
- Run on CPU: add the --cpu flag to force CPU backend.
- Example: cargo run -p legacy-inference-engine --release -- --cpu --prompt '...' --which 3-1b-it
- Use a model variant that does not hit the unimplemented kernel on Metal (e.g., older Gemma v1/v2), though many modern LLMs rely on rotary embeddings, so this may not help.
2) Recommended remediation (code/dependency changes):
- Upgrade Candle crates (candle-core, candle-transformers, etc.) to a version where the Metal backend implements rotary embeddings. Review Candles changelog/PRs for Metal/MPS kernel support and update to the first version that includes rotary-emb on Metal.
- Alternatively, implement a CPU fallback path for rotary-emb when running on Metal (hybrid execution). This is non-trivial and may degrade performance.
- Provide a configuration/flag to disable Metal by default on macOS for models known to require missing ops until Candle is upgraded.
- Clean up the misleading #[cfg(feature = "metal")] gate in main.rs to avoid confusion; Metal enablement is already handled in Cargo.toml via target-specific features.
## Suggested next steps
- Short term: document and expose --cpu usage in README and/or make the default model a Metal-compatible one until dependency upgrade.
- Medium term: bump Candle dependencies and test Gemma 3 on Metal; remove the obsolete cfg(feature = "metal") gate.
- Long term: integrate a device capability check and automatic fallback (informative log) when encountering unsupported kernels on the selected backend.
## References (code locations)
- crates/legacy-inference-engine/src/utilities_lib.rs lines 412: device selection (Metal default on macOS if available).
- crates/legacy-inference-engine/src/main.rs lines 705707: default which = 3-1b-it.
- crates/legacy-inference-engine/src/main.rs lines 758760 and 817821: Gemma 3 model selection and instantiation.
- crates/legacy-inference-engine/Cargo.toml macOS target section: Candle with features = ["metal"].
- crates/legacy-inference-engine/src/main.rs lines 1011: obsolete #[cfg(feature = "metal")] gate that triggers a warning.