mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking

2025-08-27 16:15:01 -04:00

6.3 KiB

Raw Blame History

Root Cause Analysis: Metal error "no metal implementation for rotary-emb"

Date: 2025-08-27 Component: crates/legacy-inference-engine Command to reproduce: crates/legacy-inference-engine/test_cli.sh

Summary

Running the CLI with the default model (--which 3-1b-it, i.e., Gemma 3 1B Instruct) on an Apple Silicon Mac results in a runtime failure:

modelError: Metal error no metal implementation for rotary-emb

Caused by:
    no metal implementation for rotary-emb

This occurs because the project targets the Candle Metal (MPS) backend on macOS, but the Candle version in use (0.9.1) does not provide a Metal kernel implementation for the rotary embedding operation required by Gemma 3 models. The program selects the Metal device by default on macOS and hits this missing kernel during the attention computation.

Environment and build configuration

Machine: 2024 MacBook Pro, Apple Silicon (M4 Max)
Crate: legacy-inference-engine
Candle versions: pinned to =0.9.1
- candle-core = "=0.9.1"
- candle-transformers = "=0.9.1"
macOS-specific dependency enabling Metal (file: crates/legacy-inference-engine/Cargo.toml):

[target.'cfg(target_os = "macos")'.dependencies]
candle-core = { version = "=0.9.1", features = ["metal"] }
metal = { version = "0.32.0", features = ["mps"] }

Run command (attached script): crates/legacy-inference-engine/test_cli.sh

cargo run -p legacy-inference-engine --release -- --prompt 'Name the 16th President of the USA.' --which 3-1b-it

What the code does at runtime

Device selection (defaults to Metal on macOS if available):

File: crates/legacy-inference-engine/src/utilities_lib.rs (lines 4–12)

pub fn device(cpu: bool) -> Result<Device> {
    if cpu {
        Ok(Device::Cpu)
    } else if cuda_is_available() {
        Ok(Device::new_cuda(0)?)
    } else if metal_is_available() {
        Ok(Device::new_metal(0)?)
    } else {
        // ... falls back to CPU
        Ok(Device::Cpu)
    }
}

The CLI does not pass --cpu, so on Apple Silicon with Metal available, Device::new_metal(0) is selected.

Default model selection is Gemma 3 1B Instruct:

File: crates/legacy-inference-engine/src/main.rs
Arg default (lines 705–707):

/// The model to use.
#[arg(long, default_value = "3-1b-it")]
which: Which,

Model id resolution (lines 758–760):

Which::BaseV3_1B => "google/gemma-3-1b-pt".to_string(),
Which::InstructV3_1B => "google/gemma-3-1b-it".to_string(),

Model loading uses Model3 (Gemma 3) for Which::BaseV3_1B | Which::InstructV3_1B (lines 817–821).

During generation, the Gemma 3 attention path requires rotary embeddings. On the Metal backend in Candle 0.9.1, the rotary embedding op is not implemented, resulting in the runtime error.

Additional build-time signal (misleading but not causal)

File: crates/legacy-inference-engine/src/main.rs (lines 10–11)

#[cfg(feature = "metal")]
extern crate metal_src;

Build warning: unexpected cfg condition value: metal Explanation: The project does not define a Cargo feature named "metal"; instead, Metal is enabled via target-specific dependency features in Cargo.toml. This cfg gate is ineffective and triggers a warning. It does not cause the runtime failure; it just indicates confusing/obsolete gating.

Root cause

The program runs on the Candle Metal backend (MPS) due to device auto-selection on macOS.
The selected model (Gemma 3 1B Instruct) requires the rotary embedding operation in its attention mechanism.
Candle 0.9.1’s Metal backend lacks an implementation for the rotary-emb kernel. When the model executes on Metal, it attempts to invoke this operation and fails with: "no metal implementation for rotary-emb".

Evidence

Runtime log shows the failure immediately after model load when inference begins.
Code paths confirm: device defaults to Metal on macOS; default model is Gemma 3; Gemma 3 uses rotary embeddings.
Candle version pinned to 0.9.1 where rotary-emb on Metal is not available.

Impact

Any attempt to run Gemma 3 (and possibly other rotary-embedding reliant models) on the Metal backend with Candle 0.9.1 will fail at runtime on macOS.

Workarounds and remediation options

Immediate workarounds:

Run on CPU: add the --cpu flag to force CPU backend.
- Example: cargo run -p legacy-inference-engine --release -- --cpu --prompt '...' --which 3-1b-it
Use a model variant that does not hit the unimplemented kernel on Metal (e.g., older Gemma v1/v2), though many modern LLMs rely on rotary embeddings, so this may not help.

Recommended remediation (code/dependency changes):

Upgrade Candle crates (candle-core, candle-transformers, etc.) to a version where the Metal backend implements rotary embeddings. Review Candle’s changelog/PRs for Metal/MPS kernel support and update to the first version that includes rotary-emb on Metal.
Alternatively, implement a CPU fallback path for rotary-emb when running on Metal (hybrid execution). This is non-trivial and may degrade performance.
Provide a configuration/flag to disable Metal by default on macOS for models known to require missing ops until Candle is upgraded.
Clean up the misleading #[cfg(feature = "metal")] gate in main.rs to avoid confusion; Metal enablement is already handled in Cargo.toml via target-specific features.

Suggested next steps

Short term: document and expose --cpu usage in README and/or make the default model a Metal-compatible one until dependency upgrade.
Medium term: bump Candle dependencies and test Gemma 3 on Metal; remove the obsolete cfg(feature = "metal") gate.
Long term: integrate a device capability check and automatic fallback (informative log) when encountering unsupported kernels on the selected backend.

References (code locations)

crates/legacy-inference-engine/src/utilities_lib.rs lines 4–12: device selection (Metal default on macOS if available).
crates/legacy-inference-engine/src/main.rs lines 705–707: default which = 3-1b-it.
crates/legacy-inference-engine/src/main.rs lines 758–760 and 817–821: Gemma 3 model selection and instantiation.
crates/legacy-inference-engine/Cargo.toml macOS target section: Candle with features = ["metal"].
crates/legacy-inference-engine/src/main.rs lines 10–11: obsolete #[cfg(feature = "metal")] gate that triggers a warning.

6.3 KiB Raw Blame History Unescape Escape