mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
182 lines
6.2 KiB
Markdown
182 lines
6.2 KiB
Markdown
# Performance Testing and Optimization Guide
|
|
|
|
This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components.
|
|
|
|
## Overview
|
|
|
|
The predict-otron-9000 system consists of three main components:
|
|
|
|
1. **predict-otron-9000**: The main server that integrates the other components
|
|
2. **embeddings-engine**: Generates text embeddings using the Nomic Embed Text v1.5 model
|
|
3. **inference-engine**: Handles text generation using various Gemma models
|
|
|
|
We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact.
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
|
|
- Rust 1.70+ with 2024 edition support
|
|
- Cargo package manager
|
|
- Basic understanding of the system architecture
|
|
- The project built with `cargo build --release`
|
|
|
|
### Running Performance Tests
|
|
|
|
We've created two scripts for performance testing:
|
|
|
|
1. **performance_test_embeddings.sh**: Tests embedding generation with different input sizes
|
|
2. **performance_test_inference.sh**: Tests text generation with different prompt sizes
|
|
|
|
#### Step 1: Start the Server
|
|
|
|
```bash
|
|
# Start the server in a terminal window
|
|
./run_server.sh
|
|
```
|
|
|
|
Wait for the server to fully initialize (look for "server listening" message).
|
|
|
|
#### Step 2: Run Embedding Performance Tests
|
|
|
|
In a new terminal window:
|
|
|
|
```bash
|
|
# Run the embeddings performance test
|
|
./performance_test_embeddings.sh
|
|
```
|
|
|
|
This will test embedding generation with small, medium, and large inputs and report timing metrics.
|
|
|
|
#### Step 3: Run Inference Performance Tests
|
|
|
|
```bash
|
|
# Run the inference performance test
|
|
./performance_test_inference.sh
|
|
```
|
|
|
|
This will test text generation with small, medium, and large prompts and report timing metrics.
|
|
|
|
#### Step 4: Collect and Analyze Results
|
|
|
|
The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks.
|
|
|
|
```bash
|
|
# Check server logs for detailed timing breakdowns
|
|
# Analyze the performance metrics summaries
|
|
```
|
|
|
|
## Performance Metrics Collected
|
|
|
|
### API Request Metrics (predict-otron-9000)
|
|
|
|
- Total request count
|
|
- Average response time
|
|
- Minimum response time
|
|
- Maximum response time
|
|
- Per-endpoint metrics
|
|
|
|
These metrics are logged every 60 seconds to the server console.
|
|
|
|
### Embedding Generation Metrics (embeddings-engine)
|
|
|
|
- Model initialization time
|
|
- Input processing time
|
|
- Embedding generation time
|
|
- Post-processing time
|
|
- Total request time
|
|
- Memory usage estimates
|
|
|
|
### Text Generation Metrics (inference-engine)
|
|
|
|
- Tokenization time
|
|
- Forward pass time (per token)
|
|
- Repeat penalty computation time
|
|
- Token sampling time
|
|
- Average time per token
|
|
- Total generation time
|
|
- Tokens per second rate
|
|
|
|
## Potential Optimization Areas
|
|
|
|
Based on code analysis, here are potential areas for optimization:
|
|
|
|
### Embeddings Engine
|
|
|
|
1. **Model Initialization**: The model is initialized for each request. Consider:
|
|
- Creating a persistent model instance (singleton pattern)
|
|
- Implementing a model cache
|
|
- Using a smaller model for less demanding tasks
|
|
|
|
2. **Padding Logic**: The code pads embeddings to 768 dimensions, which may be unnecessary:
|
|
- Make padding configurable
|
|
- Use the native dimension size when possible
|
|
|
|
3. **Random Embedding Generation**: When embeddings are all zeros, random embeddings are generated:
|
|
- Profile this logic to assess performance impact
|
|
- Consider pre-computing fallback embeddings
|
|
|
|
### Inference Engine
|
|
|
|
1. **Context Window Management**: The code uses different approaches for different model versions:
|
|
- Profile both approaches to determine the more efficient one
|
|
- Optimize context window size based on performance data
|
|
|
|
2. **Repeat Penalty Computation**: This computation is done for each token:
|
|
- Consider optimizing the algorithm or data structure
|
|
- Analyze if penalty strength can be reduced for better performance
|
|
|
|
3. **Tensor Operations**: The code creates new tensors frequently:
|
|
- Consider tensor reuse where possible
|
|
- Investigate more efficient tensor operations
|
|
|
|
4. **Token Streaming**: Improve the efficiency of token output streaming:
|
|
- Batch token decoding where possible
|
|
- Reduce memory allocations during streaming
|
|
|
|
## Optimization Cycle
|
|
|
|
Follow this cycle for each optimization:
|
|
|
|
1. **Measure**: Run performance tests to establish baseline
|
|
2. **Identify**: Find the biggest bottleneck based on metrics
|
|
3. **Optimize**: Make targeted changes to address the bottleneck
|
|
4. **Test**: Run performance tests again to measure improvement
|
|
5. **Repeat**: Identify the next bottleneck and continue
|
|
|
|
## Tips for Effective Optimization
|
|
|
|
1. **Make One Change at a Time**: Isolate changes to accurately measure their impact
|
|
2. **Focus on Hot Paths**: Optimize code that runs frequently or takes significant time
|
|
3. **Use Profiling Tools**: Consider using Rust profiling tools like `perf` or `flamegraph`
|
|
4. **Consider Trade-offs**: Some optimizations may increase memory usage or reduce accuracy
|
|
5. **Document Changes**: Keep track of optimizations and their measured impact
|
|
|
|
## Memory Optimization
|
|
|
|
Beyond speed, consider memory usage optimization:
|
|
|
|
1. **Monitor Memory Usage**: Use tools like `top` or `htop` to monitor process memory
|
|
2. **Reduce Allocations**: Minimize temporary allocations in hot loops
|
|
3. **Buffer Reuse**: Reuse buffers instead of creating new ones
|
|
4. **Lazy Loading**: Load resources only when needed
|
|
|
|
## Implemented Optimizations
|
|
|
|
Several optimizations have already been implemented based on this guide:
|
|
|
|
1. **Embeddings Engine**: Persistent model instance (singleton pattern) using once_cell
|
|
2. **Inference Engine**: Optimized repeat penalty computation with caching
|
|
|
|
For details on these optimizations, their implementation, and impact, see the [OPTIMIZATIONS.md](OPTIMIZATIONS.md) document.
|
|
|
|
## Next Steps
|
|
|
|
After the initial optimizations, consider these additional system-level improvements:
|
|
|
|
1. **Concurrency**: Process multiple requests in parallel where appropriate
|
|
2. **Caching**: Implement caching for common inputs/responses
|
|
3. **Load Balancing**: Distribute work across multiple instances
|
|
4. **Hardware Acceleration**: Utilize GPU or specialized hardware if available
|
|
|
|
Refer to [OPTIMIZATIONS.md](OPTIMIZATIONS.md) for a prioritized roadmap of future optimizations. |