
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
6.2 KiB
Performance Testing and Optimization Guide
This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components.
Overview
The predict-otron-9000 system consists of three main components:
- predict-otron-9000: The main server that integrates the other components
- embeddings-engine: Generates text embeddings using the Nomic Embed Text v1.5 model
- inference-engine: Handles text generation using various Gemma models
We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact.
Getting Started
Prerequisites
- Rust 1.70+ with 2024 edition support
- Cargo package manager
- Basic understanding of the system architecture
- The project built with
cargo build --release
Running Performance Tests
We've created two scripts for performance testing:
- performance_test_embeddings.sh: Tests embedding generation with different input sizes
- performance_test_inference.sh: Tests text generation with different prompt sizes
Step 1: Start the Server
# Start the server in a terminal window
./run_server.sh
Wait for the server to fully initialize (look for "server listening" message).
Step 2: Run Embedding Performance Tests
In a new terminal window:
# Run the embeddings performance test
./performance_test_embeddings.sh
This will test embedding generation with small, medium, and large inputs and report timing metrics.
Step 3: Run Inference Performance Tests
# Run the inference performance test
./performance_test_inference.sh
This will test text generation with small, medium, and large prompts and report timing metrics.
Step 4: Collect and Analyze Results
The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks.
# Check server logs for detailed timing breakdowns
# Analyze the performance metrics summaries
Performance Metrics Collected
API Request Metrics (predict-otron-9000)
- Total request count
- Average response time
- Minimum response time
- Maximum response time
- Per-endpoint metrics
These metrics are logged every 60 seconds to the server console.
Embedding Generation Metrics (embeddings-engine)
- Model initialization time
- Input processing time
- Embedding generation time
- Post-processing time
- Total request time
- Memory usage estimates
Text Generation Metrics (inference-engine)
- Tokenization time
- Forward pass time (per token)
- Repeat penalty computation time
- Token sampling time
- Average time per token
- Total generation time
- Tokens per second rate
Potential Optimization Areas
Based on code analysis, here are potential areas for optimization:
Embeddings Engine
-
Model Initialization: The model is initialized for each request. Consider:
- Creating a persistent model instance (singleton pattern)
- Implementing a model cache
- Using a smaller model for less demanding tasks
-
Padding Logic: The code pads embeddings to 768 dimensions, which may be unnecessary:
- Make padding configurable
- Use the native dimension size when possible
-
Random Embedding Generation: When embeddings are all zeros, random embeddings are generated:
- Profile this logic to assess performance impact
- Consider pre-computing fallback embeddings
Inference Engine
-
Context Window Management: The code uses different approaches for different model versions:
- Profile both approaches to determine the more efficient one
- Optimize context window size based on performance data
-
Repeat Penalty Computation: This computation is done for each token:
- Consider optimizing the algorithm or data structure
- Analyze if penalty strength can be reduced for better performance
-
Tensor Operations: The code creates new tensors frequently:
- Consider tensor reuse where possible
- Investigate more efficient tensor operations
-
Token Streaming: Improve the efficiency of token output streaming:
- Batch token decoding where possible
- Reduce memory allocations during streaming
Optimization Cycle
Follow this cycle for each optimization:
- Measure: Run performance tests to establish baseline
- Identify: Find the biggest bottleneck based on metrics
- Optimize: Make targeted changes to address the bottleneck
- Test: Run performance tests again to measure improvement
- Repeat: Identify the next bottleneck and continue
Tips for Effective Optimization
- Make One Change at a Time: Isolate changes to accurately measure their impact
- Focus on Hot Paths: Optimize code that runs frequently or takes significant time
- Use Profiling Tools: Consider using Rust profiling tools like
perf
orflamegraph
- Consider Trade-offs: Some optimizations may increase memory usage or reduce accuracy
- Document Changes: Keep track of optimizations and their measured impact
Memory Optimization
Beyond speed, consider memory usage optimization:
- Monitor Memory Usage: Use tools like
top
orhtop
to monitor process memory - Reduce Allocations: Minimize temporary allocations in hot loops
- Buffer Reuse: Reuse buffers instead of creating new ones
- Lazy Loading: Load resources only when needed
Implemented Optimizations
Several optimizations have already been implemented based on this guide:
- Embeddings Engine: Persistent model instance (singleton pattern) using once_cell
- Inference Engine: Optimized repeat penalty computation with caching
For details on these optimizations, their implementation, and impact, see the OPTIMIZATIONS.md document.
Next Steps
After the initial optimizations, consider these additional system-level improvements:
- Concurrency: Process multiple requests in parallel where appropriate
- Caching: Implement caching for common inputs/responses
- Load Balancing: Distribute work across multiple instances
- Hardware Acceleration: Utilize GPU or specialized hardware if available
Refer to OPTIMIZATIONS.md for a prioritized roadmap of future optimizations.