Files
predict-otron-9001/docs/PERFORMANCE.md
geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00

182 lines
6.2 KiB
Markdown

# Performance Testing and Optimization Guide
This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components.
## Overview
The predict-otron-9000 system consists of three main components:
1. **predict-otron-9000**: The main server that integrates the other components
2. **embeddings-engine**: Generates text embeddings using the Nomic Embed Text v1.5 model
3. **inference-engine**: Handles text generation using various Gemma models
We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact.
## Getting Started
### Prerequisites
- Rust 1.70+ with 2024 edition support
- Cargo package manager
- Basic understanding of the system architecture
- The project built with `cargo build --release`
### Running Performance Tests
We've created two scripts for performance testing:
1. **performance_test_embeddings.sh**: Tests embedding generation with different input sizes
2. **performance_test_inference.sh**: Tests text generation with different prompt sizes
#### Step 1: Start the Server
```bash
# Start the server in a terminal window
./run_server.sh
```
Wait for the server to fully initialize (look for "server listening" message).
#### Step 2: Run Embedding Performance Tests
In a new terminal window:
```bash
# Run the embeddings performance test
./performance_test_embeddings.sh
```
This will test embedding generation with small, medium, and large inputs and report timing metrics.
#### Step 3: Run Inference Performance Tests
```bash
# Run the inference performance test
./performance_test_inference.sh
```
This will test text generation with small, medium, and large prompts and report timing metrics.
#### Step 4: Collect and Analyze Results
The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks.
```bash
# Check server logs for detailed timing breakdowns
# Analyze the performance metrics summaries
```
## Performance Metrics Collected
### API Request Metrics (predict-otron-9000)
- Total request count
- Average response time
- Minimum response time
- Maximum response time
- Per-endpoint metrics
These metrics are logged every 60 seconds to the server console.
### Embedding Generation Metrics (embeddings-engine)
- Model initialization time
- Input processing time
- Embedding generation time
- Post-processing time
- Total request time
- Memory usage estimates
### Text Generation Metrics (inference-engine)
- Tokenization time
- Forward pass time (per token)
- Repeat penalty computation time
- Token sampling time
- Average time per token
- Total generation time
- Tokens per second rate
## Potential Optimization Areas
Based on code analysis, here are potential areas for optimization:
### Embeddings Engine
1. **Model Initialization**: The model is initialized for each request. Consider:
- Creating a persistent model instance (singleton pattern)
- Implementing a model cache
- Using a smaller model for less demanding tasks
2. **Padding Logic**: The code pads embeddings to 768 dimensions, which may be unnecessary:
- Make padding configurable
- Use the native dimension size when possible
3. **Random Embedding Generation**: When embeddings are all zeros, random embeddings are generated:
- Profile this logic to assess performance impact
- Consider pre-computing fallback embeddings
### Inference Engine
1. **Context Window Management**: The code uses different approaches for different model versions:
- Profile both approaches to determine the more efficient one
- Optimize context window size based on performance data
2. **Repeat Penalty Computation**: This computation is done for each token:
- Consider optimizing the algorithm or data structure
- Analyze if penalty strength can be reduced for better performance
3. **Tensor Operations**: The code creates new tensors frequently:
- Consider tensor reuse where possible
- Investigate more efficient tensor operations
4. **Token Streaming**: Improve the efficiency of token output streaming:
- Batch token decoding where possible
- Reduce memory allocations during streaming
## Optimization Cycle
Follow this cycle for each optimization:
1. **Measure**: Run performance tests to establish baseline
2. **Identify**: Find the biggest bottleneck based on metrics
3. **Optimize**: Make targeted changes to address the bottleneck
4. **Test**: Run performance tests again to measure improvement
5. **Repeat**: Identify the next bottleneck and continue
## Tips for Effective Optimization
1. **Make One Change at a Time**: Isolate changes to accurately measure their impact
2. **Focus on Hot Paths**: Optimize code that runs frequently or takes significant time
3. **Use Profiling Tools**: Consider using Rust profiling tools like `perf` or `flamegraph`
4. **Consider Trade-offs**: Some optimizations may increase memory usage or reduce accuracy
5. **Document Changes**: Keep track of optimizations and their measured impact
## Memory Optimization
Beyond speed, consider memory usage optimization:
1. **Monitor Memory Usage**: Use tools like `top` or `htop` to monitor process memory
2. **Reduce Allocations**: Minimize temporary allocations in hot loops
3. **Buffer Reuse**: Reuse buffers instead of creating new ones
4. **Lazy Loading**: Load resources only when needed
## Implemented Optimizations
Several optimizations have already been implemented based on this guide:
1. **Embeddings Engine**: Persistent model instance (singleton pattern) using once_cell
2. **Inference Engine**: Optimized repeat penalty computation with caching
For details on these optimizations, their implementation, and impact, see the [OPTIMIZATIONS.md](OPTIMIZATIONS.md) document.
## Next Steps
After the initial optimizations, consider these additional system-level improvements:
1. **Concurrency**: Process multiple requests in parallel where appropriate
2. **Caching**: Implement caching for common inputs/responses
3. **Load Balancing**: Distribute work across multiple instances
4. **Hardware Acceleration**: Utilize GPU or specialized hardware if available
Refer to [OPTIMIZATIONS.md](OPTIMIZATIONS.md) for a prioritized roadmap of future optimizations.