predict-otron-9001/crates/legacy-inference-engine/ROOT_CAUSE_ANALYSIS.md

# Root Cause Analysis: Metal error "no metal implementation for rotary-emb"

Date: 2025-08-27
Component: crates/legacy-inference-engine
Command to reproduce: crates/legacy-inference-engine/test_cli.sh

## Summary
Running the CLI with the default model (--which 3-1b-it, i.e., Gemma 3 1B Instruct) on an Apple Silicon Mac results in a runtime failure:

```
modelError: Metal error no metal implementation for rotary-emb

Caused by:
    no metal implementation for rotary-emb
```

This occurs because the project targets the Candle Metal (MPS) backend on macOS, but the Candle version in use (0.9.1) does not provide a Metal kernel implementation for the rotary embedding operation required by Gemma 3 models. The program selects the Metal device by default on macOS and hits this missing kernel during the attention computation.

## Environment and build configuration
- Machine: 2024 MacBook Pro, Apple Silicon (M4 Max)
- Crate: legacy-inference-engine
- Candle versions: pinned to =0.9.1
  - candle-core = "=0.9.1"
  - candle-transformers = "=0.9.1"
- macOS-specific dependency enabling Metal (file: crates/legacy-inference-engine/Cargo.toml):

```text
[target.'cfg(target_os = "macos")'.dependencies]
candle-core = { version = "=0.9.1", features = ["metal"] }
metal = { version = "0.32.0", features = ["mps"] }
```

- Run command (attached script): crates/legacy-inference-engine/test_cli.sh

```text
cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it
```

## What the code does at runtime
1) Device selection (defaults to Metal on macOS if available):
- File: crates/legacy-inference-engine/src/utilities_lib.rs (lines 4–12)

```text
pub fn device(cpu: bool) -> Result<Device> {
    if cpu {
        Ok(Device::Cpu)
    } else if cuda_is_available() {
        Ok(Device::new_cuda(0)?)
    } else if metal_is_available() {
        Ok(Device::new_metal(0)?)
    } else {
        // ... falls back to CPU
        Ok(Device::Cpu)
    }
}
```

- The CLI does not pass --cpu, so on Apple Silicon with Metal available, Device::new_metal(0) is selected.

2) Default model selection is Gemma 3 1B Instruct:
- File: crates/legacy-inference-engine/src/main.rs
- Arg default (lines 705–707):

```text
/// The model to use.
#[arg(long, default_value = "3-1b-it")]
which: Which,
```

- Model id resolution (lines 758–760):

```text
Which::BaseV3_1B => "google/gemma-3-1b-pt".to_string(),
Which::InstructV3_1B => "google/gemma-3-1b-it".to_string(),
```

- Model loading uses Model3 (Gemma 3) for Which::BaseV3_1B | Which::InstructV3_1B (lines 817–821).

3) During generation, the Gemma 3 attention path requires rotary embeddings. On the Metal backend in Candle 0.9.1, the rotary embedding op is not implemented, resulting in the runtime error.

## Additional build-time signal (misleading but not causal)
- File: crates/legacy-inference-engine/src/main.rs (lines 10–11)

```text
#[cfg(feature = "metal")]
extern crate metal_src;
```

- Build warning: unexpected cfg condition value: metal
Explanation: The project does not define a Cargo feature named "metal"; instead, Metal is enabled via target-specific dependency features in Cargo.toml. This cfg gate is ineffective and triggers a warning. It does not cause the runtime failure; it just indicates confusing/obsolete gating.

## Root cause
- The program runs on the Candle Metal backend (MPS) due to device auto-selection on macOS.
- The selected model (Gemma 3 1B Instruct) requires the rotary embedding operation in its attention mechanism.
- Candle 0.9.1’s Metal backend lacks an implementation for the rotary-emb kernel. When the model executes on Metal, it attempts to invoke this operation and fails with: "no metal implementation for rotary-emb".

## Evidence
- Runtime log shows the failure immediately after model load when inference begins.
- Code paths confirm: device defaults to Metal on macOS; default model is Gemma 3; Gemma 3 uses rotary embeddings.
- Candle version pinned to 0.9.1 where rotary-emb on Metal is not available.

## Impact
- Any attempt to run Gemma 3 (and possibly other rotary-embedding reliant models) on the Metal backend with Candle 0.9.1 will fail at runtime on macOS.

## Workarounds and remediation options
1) Immediate workarounds:
- Run on CPU: add the --cpu flag to force CPU backend.
  - Example: cargo run -p legacy-inference-engine --release -- --cpu --prompt '...' --which 3-1b-it
- Use a model variant that does not hit the unimplemented kernel on Metal (e.g., older Gemma v1/v2), though many modern LLMs rely on rotary embeddings, so this may not help.

2) Recommended remediation (code/dependency changes):
- Upgrade Candle crates (candle-core, candle-transformers, etc.) to a version where the Metal backend implements rotary embeddings. Review Candle’s changelog/PRs for Metal/MPS kernel support and update to the first version that includes rotary-emb on Metal.
- Alternatively, implement a CPU fallback path for rotary-emb when running on Metal (hybrid execution). This is non-trivial and may degrade performance.
- Provide a configuration/flag to disable Metal by default on macOS for models known to require missing ops until Candle is upgraded.
- Clean up the misleading #[cfg(feature = "metal")] gate in main.rs to avoid confusion; Metal enablement is already handled in Cargo.toml via target-specific features.

## Suggested next steps
- Short term: document and expose --cpu usage in README and/or make the default model a Metal-compatible one until dependency upgrade.
- Medium term: bump Candle dependencies and test Gemma 3 on Metal; remove the obsolete cfg(feature = "metal") gate.
- Long term: integrate a device capability check and automatic fallback (informative log) when encountering unsupported kernels on the selected backend.

## References (code locations)
- crates/legacy-inference-engine/src/utilities_lib.rs lines 4–12: device selection (Metal default on macOS if available).
- crates/legacy-inference-engine/src/main.rs lines 705–707: default which = 3-1b-it.
- crates/legacy-inference-engine/src/main.rs lines 758–760 and 817–821: Gemma 3 model selection and instantiation.
- crates/legacy-inference-engine/Cargo.toml macOS target section: Candle with features = ["metal"].
- crates/legacy-inference-engine/src/main.rs lines 10–11: obsolete #[cfg(feature = "metal")] gate that triggers a warning.