Files
predict-otron-9001/crates/legacy-inference-engine/ROOT_CAUSE_ANALYSIS.md
geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00

6.3 KiB
Raw Blame History

Root Cause Analysis: Metal error "no metal implementation for rotary-emb"

Date: 2025-08-27 Component: crates/legacy-inference-engine Command to reproduce: crates/legacy-inference-engine/test_cli.sh

Summary

Running the CLI with the default model (--which 3-1b-it, i.e., Gemma 3 1B Instruct) on an Apple Silicon Mac results in a runtime failure:

modelError: Metal error no metal implementation for rotary-emb

Caused by:
    no metal implementation for rotary-emb

This occurs because the project targets the Candle Metal (MPS) backend on macOS, but the Candle version in use (0.9.1) does not provide a Metal kernel implementation for the rotary embedding operation required by Gemma 3 models. The program selects the Metal device by default on macOS and hits this missing kernel during the attention computation.

Environment and build configuration

  • Machine: 2024 MacBook Pro, Apple Silicon (M4 Max)
  • Crate: legacy-inference-engine
  • Candle versions: pinned to =0.9.1
    • candle-core = "=0.9.1"
    • candle-transformers = "=0.9.1"
  • macOS-specific dependency enabling Metal (file: crates/legacy-inference-engine/Cargo.toml):
[target.'cfg(target_os = "macos")'.dependencies]
candle-core = { version = "=0.9.1", features = ["metal"] }
metal = { version = "0.32.0", features = ["mps"] }
  • Run command (attached script): crates/legacy-inference-engine/test_cli.sh
cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it

What the code does at runtime

  1. Device selection (defaults to Metal on macOS if available):
  • File: crates/legacy-inference-engine/src/utilities_lib.rs (lines 412)
pub fn device(cpu: bool) -> Result<Device> {
    if cpu {
        Ok(Device::Cpu)
    } else if cuda_is_available() {
        Ok(Device::new_cuda(0)?)
    } else if metal_is_available() {
        Ok(Device::new_metal(0)?)
    } else {
        // ... falls back to CPU
        Ok(Device::Cpu)
    }
}
  • The CLI does not pass --cpu, so on Apple Silicon with Metal available, Device::new_metal(0) is selected.
  1. Default model selection is Gemma 3 1B Instruct:
  • File: crates/legacy-inference-engine/src/main.rs
  • Arg default (lines 705707):
/// The model to use.
#[arg(long, default_value = "3-1b-it")]
which: Which,
  • Model id resolution (lines 758760):
Which::BaseV3_1B => "google/gemma-3-1b-pt".to_string(),
Which::InstructV3_1B => "google/gemma-3-1b-it".to_string(),
  • Model loading uses Model3 (Gemma 3) for Which::BaseV3_1B | Which::InstructV3_1B (lines 817821).
  1. During generation, the Gemma 3 attention path requires rotary embeddings. On the Metal backend in Candle 0.9.1, the rotary embedding op is not implemented, resulting in the runtime error.

Additional build-time signal (misleading but not causal)

  • File: crates/legacy-inference-engine/src/main.rs (lines 1011)
#[cfg(feature = "metal")]
extern crate metal_src;
  • Build warning: unexpected cfg condition value: metal Explanation: The project does not define a Cargo feature named "metal"; instead, Metal is enabled via target-specific dependency features in Cargo.toml. This cfg gate is ineffective and triggers a warning. It does not cause the runtime failure; it just indicates confusing/obsolete gating.

Root cause

  • The program runs on the Candle Metal backend (MPS) due to device auto-selection on macOS.
  • The selected model (Gemma 3 1B Instruct) requires the rotary embedding operation in its attention mechanism.
  • Candle 0.9.1s Metal backend lacks an implementation for the rotary-emb kernel. When the model executes on Metal, it attempts to invoke this operation and fails with: "no metal implementation for rotary-emb".

Evidence

  • Runtime log shows the failure immediately after model load when inference begins.
  • Code paths confirm: device defaults to Metal on macOS; default model is Gemma 3; Gemma 3 uses rotary embeddings.
  • Candle version pinned to 0.9.1 where rotary-emb on Metal is not available.

Impact

  • Any attempt to run Gemma 3 (and possibly other rotary-embedding reliant models) on the Metal backend with Candle 0.9.1 will fail at runtime on macOS.

Workarounds and remediation options

  1. Immediate workarounds:
  • Run on CPU: add the --cpu flag to force CPU backend.
    • Example: cargo run -p legacy-inference-engine --release -- --cpu --prompt '...' --which 3-1b-it
  • Use a model variant that does not hit the unimplemented kernel on Metal (e.g., older Gemma v1/v2), though many modern LLMs rely on rotary embeddings, so this may not help.
  1. Recommended remediation (code/dependency changes):
  • Upgrade Candle crates (candle-core, candle-transformers, etc.) to a version where the Metal backend implements rotary embeddings. Review Candles changelog/PRs for Metal/MPS kernel support and update to the first version that includes rotary-emb on Metal.
  • Alternatively, implement a CPU fallback path for rotary-emb when running on Metal (hybrid execution). This is non-trivial and may degrade performance.
  • Provide a configuration/flag to disable Metal by default on macOS for models known to require missing ops until Candle is upgraded.
  • Clean up the misleading #[cfg(feature = "metal")] gate in main.rs to avoid confusion; Metal enablement is already handled in Cargo.toml via target-specific features.

Suggested next steps

  • Short term: document and expose --cpu usage in README and/or make the default model a Metal-compatible one until dependency upgrade.
  • Medium term: bump Candle dependencies and test Gemma 3 on Metal; remove the obsolete cfg(feature = "metal") gate.
  • Long term: integrate a device capability check and automatic fallback (informative log) when encountering unsupported kernels on the selected backend.

References (code locations)

  • crates/legacy-inference-engine/src/utilities_lib.rs lines 412: device selection (Metal default on macOS if available).
  • crates/legacy-inference-engine/src/main.rs lines 705707: default which = 3-1b-it.
  • crates/legacy-inference-engine/src/main.rs lines 758760 and 817821: Gemma 3 model selection and instantiation.
  • crates/legacy-inference-engine/Cargo.toml macOS target section: Candle with features = ["metal"].
  • crates/legacy-inference-engine/src/main.rs lines 1011: obsolete #[cfg(feature = "metal")] gate that triggers a warning.