Remove ROOT_CAUSE_ANALYSIS.md and outdated server logs

This commit is contained in:
geoffsee
2025-08-28 08:26:18 -04:00
parent b606adbe5d
commit c8b3561e36
11 changed files with 220 additions and 547 deletions

View File

@@ -5,3 +5,5 @@ target/
/*.iml
dist
node_modules/
prompt.md
todo

2
.gitignore vendored
View File

@@ -2,7 +2,7 @@
.fastembed_cache/
target/
/.output.txt
j**i?/
.j**i?/
*.iml
dist
node_modules/

View File

@@ -1,157 +0,0 @@
# Root Cause Analysis: Token Repetition in Streaming Text Generation
**Date:** August 27, 2025
**System:** Predict-Otron-9000 Inference Engine
**Issue:** Token repetition in streaming text generation despite successful individual token streaming implementation
## Executive Summary
The Predict-Otron-9000 system has successfully implemented individual token streaming and resolved false positive stream issues in CLI multiple invocations. However, token repetition remains a critical issue that degrades output quality. This analysis identifies the root cause as insufficient context preservation in the incremental token generation process, particularly for Gemma model variants.
## Technical Background
### System Architecture
The Predict-Otron-9000 consists of several key components:
1. **Inference Engine** (`crates/inference-engine/`): Core text generation logic
- `TokenOutputStream`: Handles token-by-token decoding and streaming
- `TextGeneration`: Main generation logic with streaming support
- `server.rs`: HTTP API with Server-Sent Events (SSE) streaming
2. **CLI Client** (`cli.ts`): TypeScript client for interacting with the inference engine
3. **Model Support**: Gemma-1, Gemma-2, and Gemma-3 model variants
### Streaming Implementation Changes
#### Individual Token Generation ✅ RESOLVED
**Previous Behavior:** Tokens were generated in batches and sent all at once.
**Current Implementation:**
- `TokenOutputStream.next_token()` processes individual tokens with incremental decoding
- Modified to "include all tokens, not just alphanumeric ones" (token_output_stream.rs:44)
- Server streams tokens via SSE using callback mechanism in `TextGeneration.run_with_streaming()`
#### CLI Multiple Invocation Support ✅ RESOLVED
**Previous Issue:** Multiple CLI invocations received false positive streams from previous sessions.
**Current Solution:**
- Each CLI invocation creates a fresh OpenAI client connection
- Server calls `text_gen.reset_state()` before each streaming request
- `TokenOutputStream.clear()` resets token buffers and indices
- Penalty cache is cleared for each new generation
## Root Cause Analysis: Token Repetition
### Primary Root Cause: Insufficient Context Window
The token repetition issue stems from **severe context limitation** in the incremental generation process:
#### 1. Gemma Model Special Handling (Lines 694-806 in text_generation.rs)
```rust
// Use just the last token for subsequent iterations to avoid shape mismatch
let context_tokens = &tokens[(tokens.len()-1)..];
let start_pos = tokens.len() - 1;
```
**Problem:** For Gemma-2 and Gemma-3 models, only the **last single token** is used for subsequent forward passes. This eliminates virtually all context, forcing the model to generate based on minimal information.
#### 2. Standard Model Handling (Lines 808-850 in text_generation.rs)
```rust
let context_size = if index > 0 { 1 } else { tokens.len() };
let start_pos = tokens.len().saturating_sub(context_size);
let ctxt = &tokens[start_pos..];
```
**Problem:** After the first token, context is limited to just **1 token** for all subsequent generations, again severely restricting the model's ability to maintain coherent context.
#### 3. Penalty Cache Clearing
```rust
// Clear penalty cache for new generation
self.penalty_cache.clear();
```
**Contributing Factor:** The repeat penalty cache is cleared at the start of each streaming generation, reducing the effectiveness of repetition prevention mechanisms.
### Secondary Contributing Factors
1. **Shape Compatibility Workaround**: The single-token context approach was implemented to "avoid shape mismatch" in Gemma models, prioritizing technical compatibility over generation quality.
2. **Incremental Context Loss**: Each forward pass operates with minimal historical context, making it impossible for the model to understand what it has already generated.
3. **Inadequate Repeat Penalty Context**: The repeat penalty mechanism (`apply_cached_repeat_penalty`) has limited effectiveness when working with truncated context windows.
## Impact Analysis
### Performance Impact
- **Positive**: Individual token streaming provides responsive user experience
- **Positive**: CLI multiple invocations work correctly without interference
- **Negative**: Poor output quality due to repetitive content
### User Experience Impact
- **Critical**: Generated text contains significant repetition, reducing practical utility
- **Positive**: Real-time streaming provides immediate feedback
- **Positive**: Consistent behavior across multiple CLI sessions
### Technical Debt
- **High**: Current implementation prioritizes technical workarounds over generation quality
- **Medium**: Context limitation approach creates maintenance burden
- **Low**: Streaming infrastructure is well-architected and maintainable
## Timeline and Change History
Based on code analysis, the following changes were implemented:
1. **Token Streaming Enhancement**: Modified `TokenOutputStream` to include all tokens, not just alphanumeric
2. **Individual Token Callbacks**: Implemented streaming callbacks in `TextGeneration.run_with_streaming()`
3. **CLI State Management**: Added proper state reset and fresh connections
4. **Context Limitation Implementation**: Applied single-token context for incremental generation
5. **SSE Integration**: Implemented Server-Sent Events for real-time token delivery
## Recommendations for Future Iterations
### Immediate Priority (Critical)
1. **Implement Sliding Window Context**: Replace single-token context with a configurable sliding window (e.g., last 50-100 tokens)
2. **Context-Aware Repeat Penalty**: Maintain repeat penalty context across the full generation window
3. **Model-Specific Context Handling**: Develop proper context management for each Gemma variant without sacrificing context size
### Medium-Term Improvements
1. **Dynamic Context Sizing**: Implement adaptive context windows based on available memory and model capabilities
2. **Advanced Repetition Detection**: Implement semantic-level repetition detection beyond token-level penalties
3. **Context Compression**: Explore context compression techniques to maintain longer effective context windows
### Long-Term Enhancements
1. **Beam Search Integration**: Implement beam search with streaming for better output quality
2. **Adaptive Sampling**: Dynamic adjustment of sampling parameters based on repetition detection
3. **Model Fine-tuning**: Consider fine-tuning approaches to reduce repetition tendency at the model level
## Monitoring and Validation
### Key Metrics to Track
1. **Repetition Rate**: Measure token and n-gram repetition frequency
2. **Context Utilization**: Monitor effective context window usage
3. **Generation Quality**: Track coherence and diversity metrics
4. **Streaming Performance**: Maintain current responsiveness standards
### Testing Strategy
1. **Repetition Benchmarks**: Create standardized tests for repetition detection
2. **Context Window Testing**: Validate context preservation across different window sizes
3. **Model Variant Testing**: Ensure consistent behavior across Gemma-1, Gemma-2, and Gemma-3
4. **Regression Testing**: Maintain streaming functionality during context improvements
## Conclusion
The Predict-Otron-9000 has successfully achieved individual token streaming and eliminated false positive streams in CLI usage. However, the current implementation's approach to context management—using only single tokens for incremental generation—is the primary root cause of token repetition issues.
The solution requires balancing technical compatibility with generation quality by implementing proper sliding window context management while maintaining the current streaming performance and reliability. This represents a critical technical debt that should be addressed in the next development iteration to realize the system's full potential.
**Priority Level:** Critical
**Complexity:** Medium
**Risk Level:** Low (improvements can be made incrementally)
**User Impact:** High (significant quality improvement expected)

View File

@@ -0,0 +1,42 @@
# ---- Build stage ----
FROM rust:1-slim-bullseye AS builder
WORKDIR /usr/src/app
# Install build dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
pkg-config \
libssl-dev \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Cache deps first
COPY . ./
RUN rm -rf src
RUN mkdir src && echo "fn main() {}" > src/main.rs && echo "// lib" > src/lib.rs && cargo build --release
RUN rm -rf src
# Copy real sources and build
COPY . .
RUN cargo build --release
# ---- Runtime stage ----
FROM debian:bullseye-slim
# Install only what the compiled binary needs
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libssl1.1 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Copy binary from builder
COPY --from=builder /usr/src/app/target/release/embeddings-engine /usr/local/bin/
# Run as non-root user for safety
RUN useradd -m appuser
USER appuser
EXPOSE 8080
CMD ["embeddings-engine"]

View File

@@ -25,10 +25,6 @@ static EMBEDDING_MODEL: Lazy<TextEmbedding> = Lazy::new(|| {
model
});
pub async fn root() -> &'static str {
"Hello, World!"
}
pub async fn embeddings_create(
Json(payload): Json<CreateEmbeddingRequest>,
) -> ResponseJson<serde_json::Value> {

View File

@@ -13,9 +13,6 @@ use tracing;
const DEFAULT_SERVER_HOST: &str = "127.0.0.1";
const DEFAULT_SERVER_PORT: &str = "8080";
async fn root() -> &'static str {
"Hello, World!"
}
async fn embeddings_create(
Json(payload): Json<CreateEmbeddingRequest>,
@@ -162,24 +159,6 @@ mod tests {
use axum::http::StatusCode;
use tower::ServiceExt;
#[tokio::test]
async fn test_root() {
let app = create_app();
let response = app
.oneshot(
axum::http::Request::builder()
.uri("/")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(response.status(), StatusCode::OK);
let body = to_bytes(response.into_body(), usize::MAX).await.unwrap();
assert_eq!(&body[..], b"Hello, World!");
}
#[tokio::test]
async fn test_embeddings_create() {
// Start a test server

View File

@@ -0,0 +1,86 @@
# ---- Build stage ----
FROM rust:1-slim-bullseye AS builder
WORKDIR /usr/src/app
# Install build dependencies including CUDA toolkit for GPU support
RUN apt-get update && \
apt-get install -y --no-install-recommends \
pkg-config \
libssl-dev \
build-essential \
wget \
gnupg2 \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install CUDA toolkit (optional, for GPU support)
# This is a minimal CUDA installation for building
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb && \
dpkg -i cuda-keyring_1.0-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-minimal-build-11-8 \
libcublas-dev-11-8 \
libcurand-dev-11-8 \
&& rm -rf /var/lib/apt/lists/* \
&& rm cuda-keyring_1.0-1_all.deb
# Set CUDA environment variables
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
# Copy the entire workspace to get access to all crates
COPY . ./
# Cache dependencies first - create dummy source files
RUN rm -rf crates/inference-engine/src
RUN mkdir -p crates/inference-engine/src && \
echo "fn main() {}" > crates/inference-engine/src/main.rs && \
echo "fn main() {}" > crates/inference-engine/src/cli_main.rs && \
echo "// lib" > crates/inference-engine/src/lib.rs && \
cargo build --release --bin cli --package inference-engine
# Remove dummy source and copy real sources
RUN rm -rf crates/inference-engine/src
COPY . .
# Build the actual CLI binary
RUN cargo build --release --bin cli --package inference-engine
# ---- Runtime stage ----
FROM debian:bullseye-slim
# Install runtime dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libssl1.1 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Install CUDA runtime libraries (optional, for GPU support at runtime)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
wget \
gnupg2 \
&& wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb \
&& dpkg -i cuda-keyring_1.0-1_all.deb \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-cudart-11-8 \
libcublas11 \
libcurand10 \
&& rm -rf /var/lib/apt/lists/* \
&& rm cuda-keyring_1.0-1_all.deb \
&& apt-get purge -y wget gnupg2
# Copy binary from builder
COPY --from=builder /usr/src/app/target/release/cli /usr/local/bin/inference-cli
# Run as non-root user for safety
RUN useradd -m appuser
USER appuser
EXPOSE 8080
CMD ["inference-cli"]

View File

@@ -0,0 +1,89 @@
# ---- Build stage ----
FROM rust:1-slim-bullseye AS builder
WORKDIR /usr/src/app
# Install build dependencies including CUDA toolkit for GPU support (needed for inference-engine dependency)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
pkg-config \
libssl-dev \
build-essential \
wget \
gnupg2 \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install CUDA toolkit (required for inference-engine dependency)
# This is a minimal CUDA installation for building
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb && \
dpkg -i cuda-keyring_1.0-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-minimal-build-11-8 \
libcublas-dev-11-8 \
libcurand-dev-11-8 \
&& rm -rf /var/lib/apt/lists/* \
&& rm cuda-keyring_1.0-1_all.deb
# Set CUDA environment variables
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
# Copy the entire workspace to get access to all crates (needed for local dependencies)
COPY . ./
# Cache dependencies first - create dummy source files for all crates
RUN rm -rf crates/predict-otron-9000/src crates/inference-engine/src crates/embeddings-engine/src
RUN mkdir -p crates/predict-otron-9000/src crates/inference-engine/src crates/embeddings-engine/src && \
echo "fn main() {}" > crates/predict-otron-9000/src/main.rs && \
echo "fn main() {}" > crates/inference-engine/src/main.rs && \
echo "fn main() {}" > crates/inference-engine/src/cli_main.rs && \
echo "// lib" > crates/inference-engine/src/lib.rs && \
echo "fn main() {}" > crates/embeddings-engine/src/main.rs && \
echo "// lib" > crates/embeddings-engine/src/lib.rs && \
cargo build --release --bin predict-otron-9000 --package predict-otron-9000
# Remove dummy sources and copy real sources
RUN rm -rf crates/predict-otron-9000/src crates/inference-engine/src crates/embeddings-engine/src
COPY . .
# Build the actual binary
RUN cargo build --release --bin predict-otron-9000 --package predict-otron-9000
# ---- Runtime stage ----
FROM debian:bullseye-slim
# Install runtime dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libssl1.1 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Install CUDA runtime libraries (required for inference-engine dependency)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
wget \
gnupg2 \
&& wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb \
&& dpkg -i cuda-keyring_1.0-1_all.deb \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-cudart-11-8 \
libcublas11 \
libcurand10 \
&& rm -rf /var/lib/apt/lists/* \
&& rm cuda-keyring_1.0-1_all.deb \
&& apt-get purge -y wget gnupg2
# Copy binary from builder
COPY --from=builder /usr/src/app/target/release/predict-otron-9000 /usr/local/bin/
# Run as non-root user for safety
RUN useradd -m appuser
USER appuser
EXPOSE 8080
CMD ["predict-otron-9000"]

View File

@@ -1,48 +0,0 @@
Compiling inference-engine v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/inference-engine)
warning: unused import: `Config as Config1`
--> crates/inference-engine/src/model.rs:2:42
|
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
| ^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `Config as Config2`
--> crates/inference-engine/src/model.rs:3:43
|
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `Config as Config3`
--> crates/inference-engine/src/model.rs:4:43
|
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `self`
--> crates/inference-engine/src/server.rs:10:28
|
10 | use futures_util::stream::{self, Stream};
| ^^^^
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
Compiling predict-otron-9000 v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/predict-otron-9000)
Finished `release` profile [optimized] target(s) in 4.01s
Running `target/release/predict-otron-9000`
2025-08-28T01:43:11.512475Z  INFO predict_otron_9000::middleware::metrics: Performance metrics summary:
avx: false, neon: true, simd128: false, f16c: false
2025-08-28T01:43:11.512811Z  INFO hf_hub: Using token file found "/Users/williamseemueller/.cache/huggingface/token"
retrieved the files in 685.958µs
2025-08-28T01:43:12.661378Z  INFO predict_otron_9000: Unified predict-otron-9000 server listening on 127.0.0.1:8080
2025-08-28T01:43:12.661400Z  INFO predict_otron_9000: Performance metrics tracking enabled - summary logs every 60 seconds
2025-08-28T01:43:12.661403Z  INFO predict_otron_9000: Available endpoints:
2025-08-28T01:43:12.661405Z  INFO predict_otron_9000: GET / - Root endpoint from embeddings-engine
2025-08-28T01:43:12.661407Z  INFO predict_otron_9000: POST /v1/embeddings - Text embeddings
2025-08-28T01:43:12.661409Z  INFO predict_otron_9000: POST /v1/chat/completions - Chat completions
2025-08-28T01:43:19.166677Z  WARN inference_engine::server: Detected repetition pattern: ' plus' (count: 1)
2025-08-28T01:43:19.296257Z  WARN inference_engine::server: Detected repetition pattern: ' plus' (count: 2)
2025-08-28T01:43:19.424883Z  WARN inference_engine::server: Detected repetition pattern: ' plus' (count: 3)
2025-08-28T01:43:19.554508Z  WARN inference_engine::server: Detected repetition pattern: ' plus' (count: 4)
2025-08-28T01:43:19.683153Z  WARN inference_engine::server: Detected repetition pattern: ' plus' (count: 5)
2025-08-28T01:43:19.683181Z  INFO inference_engine::server: Stopping generation due to excessive repetition
2025-08-28T01:43:19.683221Z  INFO inference_engine::server: Text generation stopped: Repetition detected - stopping generation

View File

@@ -1,277 +0,0 @@
warning: unused import: `Config as Config1`
--> crates/inference-engine/src/model.rs:2:42
|
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
| ^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `Config as Config2`
--> crates/inference-engine/src/model.rs:3:43
|
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `Config as Config3`
--> crates/inference-engine/src/model.rs:4:43
|
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `self`
--> crates/inference-engine/src/server.rs:10:28
|
10 | use futures_util::stream::{self, Stream};
| ^^^^
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
Finished `release` profile [optimized] target(s) in 0.13s
Running `target/release/predict-otron-9000`
avx: false, neon: true, simd128: false, f16c: false
2025-08-28T00:34:39.293635Z  INFO hf_hub: Using token file found "/Users/williamseemueller/.cache/huggingface/token"
retrieved the files in 295.458µs
2025-08-28T00:34:39.294536Z  INFO predict_otron_9000::middleware::metrics: Performance metrics summary:
2025-08-28T00:34:40.507474Z  INFO predict_otron_9000: Unified predict-otron-9000 server listening on 127.0.0.1:8080
2025-08-28T00:34:40.507503Z  INFO predict_otron_9000: Performance metrics tracking enabled - summary logs every 60 seconds
2025-08-28T00:34:40.507508Z  INFO predict_otron_9000: Available endpoints:
2025-08-28T00:34:40.507512Z  INFO predict_otron_9000: GET / - Root endpoint from embeddings-engine
2025-08-28T00:34:40.507515Z  INFO predict_otron_9000: POST /v1/embeddings - Text embeddings
2025-08-28T00:34:40.507517Z  INFO predict_otron_9000: POST /v1/chat/completions - Chat completions
2025-08-28T00:34:52.313606Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2025-08-28T00:34:52.313671Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: inference_engine::server: Formatted prompt: <start_of_turn>user
You are a helpful assistant who responds thoughtfully and concisely.
Write a paragraph about dogs<end_of_turn>
<start_of_turn>model
2025-08-28T00:34:52.313693Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: predict_otron_9000::middleware::metrics: POST /v1/chat/completions 200 OK - 0 ms
2025-08-28T00:34:52.313709Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
2025-08-28T00:34:52.313763Z DEBUG inference_engine::text_generation: Cleared penalty cache for new generation (streaming mode)
2025-08-28T00:34:52.313985Z DEBUG inference_engine::text_generation: Streaming Tokenization completed in 217.04µs
2025-08-28T00:34:52.313990Z DEBUG inference_engine::text_generation: Streaming Input tokens: 26
2025-08-28T00:34:52.340937Z DEBUG inference_engine::text_generation: Using special generation approach for gemma-2/gemma-3 models (streaming)
2025-08-28T00:34:52.602691Z DEBUG inference_engine::server: Streaming token: 'Dogs'
2025-08-28T00:34:52.602718Z DEBUG inference_engine::server: Sending chunk with content: 'Dogs'
2025-08-28T00:34:52.769918Z DEBUG inference_engine::server: Streaming token: ' have'
2025-08-28T00:34:52.769949Z DEBUG inference_engine::server: Sending chunk with content: ' have'
2025-08-28T00:34:52.905947Z DEBUG inference_engine::server: Streaming token: ' captivated'
2025-08-28T00:34:52.905977Z DEBUG inference_engine::server: Sending chunk with content: ' captivated'
2025-08-28T00:34:53.040888Z DEBUG inference_engine::server: Streaming token: ' humans'
2025-08-28T00:34:53.040921Z DEBUG inference_engine::server: Sending chunk with content: ' humans'
2025-08-28T00:34:53.177116Z DEBUG inference_engine::server: Streaming token: ' for'
2025-08-28T00:34:53.177145Z DEBUG inference_engine::server: Sending chunk with content: ' for'
2025-08-28T00:34:53.313887Z DEBUG inference_engine::server: Streaming token: ' millennia'
2025-08-28T00:34:53.313920Z DEBUG inference_engine::server: Sending chunk with content: ' millennia'
2025-08-28T00:34:53.444031Z DEBUG inference_engine::server: Streaming token: ','
2025-08-28T00:34:53.444060Z DEBUG inference_engine::server: Sending chunk with content: ','
2025-08-28T00:34:53.571919Z DEBUG inference_engine::server: Streaming token: ' evolving'
2025-08-28T00:34:53.571951Z DEBUG inference_engine::server: Sending chunk with content: ' evolving'
2025-08-28T00:34:53.699811Z DEBUG inference_engine::server: Streaming token: ' from'
2025-08-28T00:34:53.699852Z DEBUG inference_engine::server: Sending chunk with content: ' from'
2025-08-28T00:34:53.828082Z DEBUG inference_engine::server: Streaming token: ' wolves'
2025-08-28T00:34:53.828111Z DEBUG inference_engine::server: Sending chunk with content: ' wolves'
2025-08-28T00:34:53.957276Z DEBUG inference_engine::server: Streaming token: ' to'
2025-08-28T00:34:53.957313Z DEBUG inference_engine::server: Sending chunk with content: ' to'
2025-08-28T00:34:54.093248Z DEBUG inference_engine::server: Streaming token: ' beloved'
2025-08-28T00:34:54.093284Z DEBUG inference_engine::server: Sending chunk with content: ' beloved'
2025-08-28T00:34:54.228357Z DEBUG inference_engine::server: Streaming token: ' companions'
2025-08-28T00:34:54.228385Z DEBUG inference_engine::server: Sending chunk with content: ' companions'
2025-08-28T00:34:54.356315Z DEBUG inference_engine::server: Streaming token: ' offering'
2025-08-28T00:34:54.356349Z DEBUG inference_engine::server: Sending chunk with content: ' offering'
2025-08-28T00:34:54.484051Z DEBUG inference_engine::server: Streaming token: ' unwavering'
2025-08-28T00:34:54.484085Z DEBUG inference_engine::server: Sending chunk with content: ' unwavering'
2025-08-28T00:34:54.613022Z DEBUG inference_engine::server: Streaming token: ' loyalty'
2025-08-28T00:34:54.613061Z DEBUG inference_engine::server: Sending chunk with content: ' loyalty'
2025-08-28T00:34:54.742024Z DEBUG inference_engine::server: Streaming token: ' alongside'
2025-08-28T00:34:54.742043Z DEBUG inference_engine::server: Sending chunk with content: ' alongside'
2025-08-28T00:34:54.869804Z DEBUG inference_engine::server: Streaming token: ' boundless'
2025-08-28T00:34:54.869829Z DEBUG inference_engine::server: Sending chunk with content: ' boundless'
2025-08-28T00:34:54.998140Z DEBUG inference_engine::server: Streaming token: ' affection'
2025-08-28T00:34:54.998165Z DEBUG inference_engine::server: Sending chunk with content: ' affection'
2025-08-28T00:34:55.126560Z DEBUG inference_engine::server: Streaming token: ' '
2025-08-28T00:34:55.126582Z DEBUG inference_engine::server: Sending chunk with content: ' '
2025-08-28T00:34:55.255214Z DEBUG inference_engine::server: Streaming token: ' often'
2025-08-28T00:34:55.255232Z DEBUG inference_engine::server: Sending chunk with content: ' often'
2025-08-28T00:34:55.383529Z DEBUG inference_engine::server: Streaming token: ' fueled'
2025-08-28T00:34:55.383551Z DEBUG inference_engine::server: Sending chunk with content: ' fueled'
2025-08-28T00:34:55.511437Z DEBUG inference_engine::server: Streaming token: ' by'
2025-08-28T00:34:55.511456Z DEBUG inference_engine::server: Sending chunk with content: ' by'
2025-08-28T00:34:55.639748Z DEBUG inference_engine::server: Streaming token: ' their'
2025-08-28T00:34:55.639768Z DEBUG inference_engine::server: Sending chunk with content: ' their'
2025-08-28T00:34:55.767723Z DEBUG inference_engine::server: Streaming token: ' incredible'
2025-08-28T00:34:55.767741Z DEBUG inference_engine::server: Sending chunk with content: ' incredible'
2025-08-28T00:34:55.895796Z DEBUG inference_engine::server: Streaming token: ' ability'
2025-08-28T00:34:55.895817Z DEBUG inference_engine::server: Sending chunk with content: ' ability'
2025-08-28T00:34:56.025191Z DEBUG inference_engine::server: Streaming token: ' at'
2025-08-28T00:34:56.025219Z DEBUG inference_engine::server: Sending chunk with content: ' at'
2025-08-28T00:34:56.153604Z DEBUG inference_engine::server: Streaming token: ' understanding'
2025-08-28T00:34:56.153626Z DEBUG inference_engine::server: Sending chunk with content: ' understanding'
2025-08-28T00:34:56.282571Z DEBUG inference_engine::server: Streaming token: ' human'
2025-08-28T00:34:56.282590Z DEBUG inference_engine::server: Sending chunk with content: ' human'
2025-08-28T00:34:56.411224Z DEBUG inference_engine::server: Streaming token: ' emotion'
2025-08-28T00:34:56.411247Z DEBUG inference_engine::server: Sending chunk with content: ' emotion'
2025-08-28T00:34:56.540028Z DEBUG inference_engine::server: Streaming token: ' through'
2025-08-28T00:34:56.540050Z DEBUG inference_engine::server: Sending chunk with content: ' through'
2025-08-28T00:34:56.668612Z DEBUG inference_engine::server: Streaming token: ' subtle'
2025-08-28T00:34:56.668630Z DEBUG inference_engine::server: Sending chunk with content: ' subtle'
2025-08-28T00:34:56.797698Z DEBUG inference_engine::server: Streaming token: ' cues'
2025-08-28T00:34:56.797716Z DEBUG inference_engine::server: Sending chunk with content: ' cues'
2025-08-28T00:34:56.927032Z DEBUG inference_engine::server: Streaming token: '!'
2025-08-28T00:34:56.927054Z DEBUG inference_engine::server: Sending chunk with content: '!'
2025-08-28T00:34:57.054903Z DEBUG inference_engine::server: Streaming token: ' Beyond'
2025-08-28T00:34:57.054922Z DEBUG inference_engine::server: Sending chunk with content: ' Beyond'
2025-08-28T00:34:57.183890Z DEBUG inference_engine::server: Streaming token: ' companionship'
2025-08-28T00:34:57.183914Z DEBUG inference_engine::server: Sending chunk with content: ' companionship'
2025-08-28T00:34:57.313258Z DEBUG inference_engine::server: Streaming token: ' they'
2025-08-28T00:34:57.313278Z DEBUG inference_engine::server: Sending chunk with content: ' they'
2025-08-28T00:34:57.441875Z DEBUG inference_engine::server: Streaming token: ' provide'
2025-08-28T00:34:57.441897Z DEBUG inference_engine::server: Sending chunk with content: ' provide'
2025-08-28T00:34:57.569839Z DEBUG inference_engine::server: Streaming token: ' crucial'
2025-08-28T00:34:57.569864Z DEBUG inference_engine::server: Sending chunk with content: ' crucial'
2025-08-28T00:34:57.700161Z DEBUG inference_engine::server: Streaming token: ' assistance'
2025-08-28T00:34:57.700184Z DEBUG inference_engine::server: Sending chunk with content: ' assistance'
2025-08-28T00:34:57.828427Z DEBUG inference_engine::server: Streaming token: ' with'
2025-08-28T00:34:57.828453Z DEBUG inference_engine::server: Sending chunk with content: ' with'
2025-08-28T00:34:57.957703Z DEBUG inference_engine::server: Streaming token: ' tasks'
2025-08-28T00:34:57.957727Z DEBUG inference_engine::server: Sending chunk with content: ' tasks'
2025-08-28T00:34:58.085556Z DEBUG inference_engine::server: Streaming token: ' like'
2025-08-28T00:34:58.085579Z DEBUG inference_engine::server: Sending chunk with content: ' like'
2025-08-28T00:34:58.213727Z DEBUG inference_engine::server: Streaming token: ' guarding'
2025-08-28T00:34:58.213750Z DEBUG inference_engine::server: Sending chunk with content: ' guarding'
2025-08-28T00:34:58.342674Z DEBUG inference_engine::server: Streaming token: ' property'
2025-08-28T00:34:58.342696Z DEBUG inference_engine::server: Sending chunk with content: ' property'
2025-08-28T00:34:58.474992Z DEBUG inference_engine::server: Streaming token: ' or'
2025-08-28T00:34:58.475011Z DEBUG inference_engine::server: Sending chunk with content: ' or'
2025-08-28T00:34:58.603613Z DEBUG inference_engine::server: Streaming token: ' assisting'
2025-08-28T00:34:58.603636Z DEBUG inference_engine::server: Sending chunk with content: ' assisting'
2025-08-28T00:34:58.732292Z DEBUG inference_engine::server: Streaming token: ' individuals'
2025-08-28T00:34:58.732316Z DEBUG inference_engine::server: Sending chunk with content: ' individuals'
2025-08-28T00:34:58.861810Z DEBUG inference_engine::server: Streaming token: ' who'
2025-08-28T00:34:58.861847Z DEBUG inference_engine::server: Sending chunk with content: ' who'
2025-08-28T00:34:58.989748Z DEBUG inference_engine::server: Streaming token: ' are'
2025-08-28T00:34:58.989765Z DEBUG inference_engine::server: Sending chunk with content: ' are'
2025-08-28T00:34:59.118088Z DEBUG inference_engine::server: Streaming token: ' blind'
2025-08-28T00:34:59.118105Z DEBUG inference_engine::server: Sending chunk with content: ' blind'
2025-08-28T00:34:59.246722Z DEBUG inference_engine::server: Streaming token: ' and'
2025-08-28T00:34:59.246746Z DEBUG inference_engine::server: Sending chunk with content: ' and'
2025-08-28T00:34:59.375090Z DEBUG inference_engine::server: Streaming token: ' deaf'
2025-08-28T00:34:59.375119Z DEBUG inference_engine::server: Sending chunk with content: ' deaf'
2025-08-28T00:34:59.503369Z DEBUG inference_engine::server: Streaming token: '.'
2025-08-28T00:34:59.503398Z DEBUG inference_engine::server: Sending chunk with content: '.'
2025-08-28T00:34:59.632352Z DEBUG inference_engine::server: Streaming token: ' Their'
2025-08-28T00:34:59.632374Z DEBUG inference_engine::server: Sending chunk with content: ' Their'
2025-08-28T00:34:59.760656Z DEBUG inference_engine::server: Streaming token: ' diverse'
2025-08-28T00:34:59.760675Z DEBUG inference_engine::server: Sending chunk with content: ' diverse'
2025-08-28T00:34:59.889274Z DEBUG inference_engine::server: Streaming token: ' breeds'
2025-08-28T00:34:59.889293Z DEBUG inference_engine::server: Sending chunk with content: ' breeds'
2025-08-28T00:35:00.018013Z DEBUG inference_engine::server: Streaming token: ' reflect'
2025-08-28T00:35:00.018043Z DEBUG inference_engine::server: Sending chunk with content: ' reflect'
2025-08-28T00:35:00.146874Z DEBUG inference_engine::server: Streaming token: ' a'
2025-08-28T00:35:00.146903Z DEBUG inference_engine::server: Sending chunk with content: ' a'
2025-08-28T00:35:00.275232Z DEBUG inference_engine::server: Streaming token: ' fascinating'
2025-08-28T00:35:00.275257Z DEBUG inference_engine::server: Sending chunk with content: ' fascinating'
2025-08-28T00:35:00.403452Z DEBUG inference_engine::server: Streaming token: ' range'
2025-08-28T00:35:00.403472Z DEBUG inference_engine::server: Sending chunk with content: ' range'
2025-08-28T00:35:00.535110Z DEBUG inference_engine::server: Streaming token: ' of'
2025-08-28T00:35:00.535133Z DEBUG inference_engine::server: Sending chunk with content: ' of'
2025-08-28T00:35:00.663383Z DEBUG inference_engine::server: Streaming token: ' personalities'
2025-08-28T00:35:00.663402Z DEBUG inference_engine::server: Sending chunk with content: ' personalities'
2025-08-28T00:35:00.792808Z DEBUG inference_engine::server: Streaming token: ' shaped'
2025-08-28T00:35:00.792836Z DEBUG inference_engine::server: Sending chunk with content: ' shaped'
2025-08-28T00:35:00.921350Z DEBUG inference_engine::server: Streaming token: ' over'
2025-08-28T00:35:00.921378Z DEBUG inference_engine::server: Sending chunk with content: ' over'
2025-08-28T00:35:01.049207Z DEBUG inference_engine::server: Streaming token: ' countless'
2025-08-28T00:35:01.049228Z DEBUG inference_engine::server: Sending chunk with content: ' countless'
2025-08-28T00:35:01.178030Z DEBUG inference_engine::server: Streaming token: ' generations'
2025-08-28T00:35:01.178058Z DEBUG inference_engine::server: Sending chunk with content: ' generations'
2025-08-28T00:35:01.306740Z DEBUG inference_engine::server: Streaming token: '،'
2025-08-28T00:35:01.306762Z DEBUG inference_engine::server: Sending chunk with content: '،'
2025-08-28T00:35:01.434552Z DEBUG inference_engine::server: Streaming token: ' making'
2025-08-28T00:35:01.434573Z DEBUG inference_engine::server: Sending chunk with content: ' making'
2025-08-28T00:35:01.562628Z DEBUG inference_engine::server: Streaming token: ' them'
2025-08-28T00:35:01.562647Z DEBUG inference_engine::server: Sending chunk with content: ' them'
2025-08-28T00:35:01.690509Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:01.690530Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:01.819330Z DEBUG inference_engine::server: Streaming token: ' unique'
2025-08-28T00:35:01.819351Z DEBUG inference_engine::server: Sending chunk with content: ' unique'
2025-08-28T00:35:01.947700Z DEBUG inference_engine::server: Streaming token: ' members'
2025-08-28T00:35:01.947720Z DEBUG inference_engine::server: Sending chunk with content: ' members'
2025-08-28T00:35:02.076045Z DEBUG inference_engine::server: Streaming token: ' within'
2025-08-28T00:35:02.076071Z DEBUG inference_engine::server: Sending chunk with content: ' within'
2025-08-28T00:35:02.204721Z DEBUG inference_engine::server: Streaming token: ' our'
2025-08-28T00:35:02.204743Z DEBUG inference_engine::server: Sending chunk with content: ' our'
2025-08-28T00:35:02.333483Z DEBUG inference_engine::server: Streaming token: ' families'
2025-08-28T00:35:02.333506Z DEBUG inference_engine::server: Sending chunk with content: ' families'
2025-08-28T00:35:02.461905Z DEBUG inference_engine::server: Streaming token: ','
2025-08-28T00:35:02.461926Z DEBUG inference_engine::server: Sending chunk with content: ','
2025-08-28T00:35:02.589686Z DEBUG inference_engine::server: Streaming token: ' enriching'
2025-08-28T00:35:02.589710Z DEBUG inference_engine::server: Sending chunk with content: ' enriching'
2025-08-28T00:35:02.718589Z DEBUG inference_engine::server: Streaming token: ' lives'
2025-08-28T00:35:02.718618Z DEBUG inference_engine::server: Sending chunk with content: ' lives'
2025-08-28T00:35:02.846614Z DEBUG inference_engine::server: Streaming token: ' in'
2025-08-28T00:35:02.846635Z DEBUG inference_engine::server: Sending chunk with content: ' in'
2025-08-28T00:35:02.976008Z DEBUG inference_engine::server: Streaming token: ' profound'
2025-08-28T00:35:02.976028Z DEBUG inference_engine::server: Sending chunk with content: ' profound'
2025-08-28T00:35:03.107573Z DEBUG inference_engine::server: Streaming token: ' ways'
2025-08-28T00:35:03.107594Z DEBUG inference_engine::server: Sending chunk with content: ' ways'
2025-08-28T00:35:03.236069Z DEBUG inference_engine::server: Streaming token: ' regardless'
2025-08-28T00:35:03.236088Z DEBUG inference_engine::server: Sending chunk with content: ' regardless'
2025-08-28T00:35:03.364469Z DEBUG inference_engine::server: Streaming token: ' if'
2025-08-28T00:35:03.364492Z DEBUG inference_engine::server: Sending chunk with content: ' if'
2025-08-28T00:35:03.492669Z DEBUG inference_engine::server: Streaming token: ' we'
2025-08-28T00:35:03.492690Z DEBUG inference_engine::server: Sending chunk with content: ' we'
2025-08-28T00:35:03.621905Z DEBUG inference_engine::server: Streaming token: ' choose'
2025-08-28T00:35:03.621927Z DEBUG inference_engine::server: Sending chunk with content: ' choose'
2025-08-28T00:35:03.754038Z DEBUG inference_engine::server: Streaming token: ' to'
2025-08-28T00:35:03.754059Z DEBUG inference_engine::server: Sending chunk with content: ' to'
2025-08-28T00:35:03.883044Z DEBUG inference_engine::server: Streaming token: ' own'
2025-08-28T00:35:03.883066Z DEBUG inference_engine::server: Sending chunk with content: ' own'
2025-08-28T00:35:04.010685Z DEBUG inference_engine::server: Streaming token: ' one'
2025-08-28T00:35:04.010703Z DEBUG inference_engine::server: Sending chunk with content: ' one'
2025-08-28T00:35:04.139584Z DEBUG inference_engine::server: Streaming token: ' ourselves'
2025-08-28T00:35:04.139609Z DEBUG inference_engine::server: Sending chunk with content: ' ourselves'
2025-08-28T00:35:04.269128Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.269144Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:04.398132Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.398151Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:04.527627Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.527654Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:04.657885Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.657914Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:04.788586Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.788607Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:04.918153Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:04.918179Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:05.048431Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:05.048460Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:05.178022Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:05.178055Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:05.308805Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:05.308833Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:05.438091Z DEBUG inference_engine::server: Streaming token: ' truly'
2025-08-28T00:35:05.438113Z DEBUG inference_engine::server: Sending chunk with content: ' truly'
2025-08-28T00:35:05.561745Z  INFO inference_engine::text_generation: Streaming Text generation completed in 13.22s
2025-08-28T00:35:05.561767Z  INFO inference_engine::text_generation: Streaming Tokens generated: 100
2025-08-28T00:35:05.561770Z  INFO inference_engine::text_generation: Streaming Generation speed: 7.56 tokens/second
2025-08-28T00:35:05.561772Z  INFO inference_engine::text_generation: Streaming Average time per token: 129.65ms
2025-08-28T00:35:05.561774Z DEBUG inference_engine::text_generation: Streaming - Forward pass: 124.98ms (96.4%)
2025-08-28T00:35:05.561776Z DEBUG inference_engine::text_generation: Streaming - Repeat penalty: 74.02µs (0.1%)
2025-08-28T00:35:05.561778Z DEBUG inference_engine::text_generation: Streaming - Sampling: 5.85ms (4.5%)
2025-08-28T00:35:05.561779Z  INFO inference_engine::text_generation: Streaming Total request time: 13.25s
2025-08-28T00:35:05.561781Z DEBUG inference_engine::text_generation: Streaming - Tokenization: 217.04µs (0.0%)
2025-08-28T00:35:05.561782Z DEBUG inference_engine::text_generation: Streaming - Generation: 13.22s (99.8%)
2025-08-28T00:35:05.561783Z DEBUG inference_engine::text_generation: Streaming - Final decoding: 8.17µs (0.0%)
2025-08-28T00:35:30.845607Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2025-08-28T00:35:30.845670Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: inference_engine::server: Formatted prompt: <start_of_turn>user
You are a helpful assistant who responds thoughtfully and concisely.
Write a paragraph about cats<end_of_turn>
<start_of_turn>model
2025-08-28T00:35:30.845684Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: predict_otron_9000::middleware::metrics: POST /v1/chat/completions 200 OK - 0 ms
2025-08-28T00:35:30.845691Z DEBUG request{method=POST uri=/v1/chat/completions version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
2025-08-28T00:35:30.845719Z DEBUG inference_engine::text_generation: Cleared penalty cache for new generation (streaming mode)
2025-08-28T00:35:30.845789Z DEBUG inference_engine::text_generation: Streaming Tokenization completed in 65.50µs
2025-08-28T00:35:30.845794Z DEBUG inference_engine::text_generation: Streaming Input tokens: 26
2025-08-28T00:35:30.871195Z DEBUG inference_engine::text_generation: Using special generation approach for gemma-2/gemma-3 models (streaming)
./run_server.sh: line 7: 30566 Killed: 9 cargo run --bin predict-otron-9000 --release

View File

@@ -1,39 +0,0 @@
Compiling inference-engine v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/inference-engine)
warning: unused import: `Config as Config1`
--> crates/inference-engine/src/model.rs:2:42
|
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
| ^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `Config as Config2`
--> crates/inference-engine/src/model.rs:3:43
|
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `Config as Config3`
--> crates/inference-engine/src/model.rs:4:43
|
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `self`
--> crates/inference-engine/src/server.rs:10:28
|
10 | use futures_util::stream::{self, Stream};
| ^^^^
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
Compiling predict-otron-9000 v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/predict-otron-9000)
Finished `release` profile [optimized] target(s) in 4.24s
Running `target/release/predict-otron-9000`
avx: false, neon: true, simd128: false, f16c: false
2025-08-28T00:28:26.075133Z  INFO hf_hub: Using token file found "/Users/williamseemueller/.cache/huggingface/token"
retrieved the files in 557.625µs
2025-08-28T00:28:26.075815Z  INFO predict_otron_9000::middleware::metrics: Performance metrics summary:
thread 'main' panicked at crates/predict-otron-9000/src/main.rs:91:61:
called `Result::unwrap()` on an `Err` value: Os { code: 48, kind: AddrInUse, message: "Address already in use" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace