# Performance Testing and Optimization Guide This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components. ## Overview The predict-otron-9000 system consists of three main components: 1. **predict-otron-9000**: The main server that integrates the other components 2. **embeddings-engine**: Generates text embeddings using the Nomic Embed Text v1.5 model 3. **inference-engine**: Handles text generation using various Gemma models We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact. ## Getting Started ### Prerequisites - Rust 1.70+ with 2024 edition support - Cargo package manager - Basic understanding of the system architecture - The project built with `cargo build --release` ### Running Performance Tests We've created two scripts for performance testing: 1. **performance_test_embeddings.sh**: Tests embedding generation with different input sizes 2. **performance_test_inference.sh**: Tests text generation with different prompt sizes #### Step 1: Start the Server ```bash # Start the server in a terminal window ./run_server.sh ``` Wait for the server to fully initialize (look for "server listening" message). #### Step 2: Run Embedding Performance Tests In a new terminal window: ```bash # Run the embeddings performance test ./performance_test_embeddings.sh ``` This will test embedding generation with small, medium, and large inputs and report timing metrics. #### Step 3: Run Inference Performance Tests ```bash # Run the inference performance test ./performance_test_inference.sh ``` This will test text generation with small, medium, and large prompts and report timing metrics. #### Step 4: Collect and Analyze Results The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks. ```bash # Check server logs for detailed timing breakdowns # Analyze the performance metrics summaries ``` ## Performance Metrics Collected ### API Request Metrics (predict-otron-9000) - Total request count - Average response time - Minimum response time - Maximum response time - Per-endpoint metrics These metrics are logged every 60 seconds to the server console. ### Embedding Generation Metrics (embeddings-engine) - Model initialization time - Input processing time - Embedding generation time - Post-processing time - Total request time - Memory usage estimates ### Text Generation Metrics (inference-engine) - Tokenization time - Forward pass time (per token) - Repeat penalty computation time - Token sampling time - Average time per token - Total generation time - Tokens per second rate ## Potential Optimization Areas Based on code analysis, here are potential areas for optimization: ### Embeddings Engine 1. **Model Initialization**: The model is initialized for each request. Consider: - Creating a persistent model instance (singleton pattern) - Implementing a model cache - Using a smaller model for less demanding tasks 2. **Padding Logic**: The code pads embeddings to 768 dimensions, which may be unnecessary: - Make padding configurable - Use the native dimension size when possible 3. **Random Embedding Generation**: When embeddings are all zeros, random embeddings are generated: - Profile this logic to assess performance impact - Consider pre-computing fallback embeddings ### Inference Engine 1. **Context Window Management**: The code uses different approaches for different model versions: - Profile both approaches to determine the more efficient one - Optimize context window size based on performance data 2. **Repeat Penalty Computation**: This computation is done for each token: - Consider optimizing the algorithm or data structure - Analyze if penalty strength can be reduced for better performance 3. **Tensor Operations**: The code creates new tensors frequently: - Consider tensor reuse where possible - Investigate more efficient tensor operations 4. **Token Streaming**: Improve the efficiency of token output streaming: - Batch token decoding where possible - Reduce memory allocations during streaming ## Optimization Cycle Follow this cycle for each optimization: 1. **Measure**: Run performance tests to establish baseline 2. **Identify**: Find the biggest bottleneck based on metrics 3. **Optimize**: Make targeted changes to address the bottleneck 4. **Test**: Run performance tests again to measure improvement 5. **Repeat**: Identify the next bottleneck and continue ## Tips for Effective Optimization 1. **Make One Change at a Time**: Isolate changes to accurately measure their impact 2. **Focus on Hot Paths**: Optimize code that runs frequently or takes significant time 3. **Use Profiling Tools**: Consider using Rust profiling tools like `perf` or `flamegraph` 4. **Consider Trade-offs**: Some optimizations may increase memory usage or reduce accuracy 5. **Document Changes**: Keep track of optimizations and their measured impact ## Memory Optimization Beyond speed, consider memory usage optimization: 1. **Monitor Memory Usage**: Use tools like `top` or `htop` to monitor process memory 2. **Reduce Allocations**: Minimize temporary allocations in hot loops 3. **Buffer Reuse**: Reuse buffers instead of creating new ones 4. **Lazy Loading**: Load resources only when needed ## Implemented Optimizations Several optimizations have already been implemented based on this guide: 1. **Embeddings Engine**: Persistent model instance (singleton pattern) using once_cell 2. **Inference Engine**: Optimized repeat penalty computation with caching For details on these optimizations, their implementation, and impact, see the [OPTIMIZATIONS.md](OPTIMIZATIONS.md) document. ## Next Steps After the initial optimizations, consider these additional system-level improvements: 1. **Concurrency**: Process multiple requests in parallel where appropriate 2. **Caching**: Implement caching for common inputs/responses 3. **Load Balancing**: Distribute work across multiple instances 4. **Hardware Acceleration**: Utilize GPU or specialized hardware if available Refer to [OPTIMIZATIONS.md](OPTIMIZATIONS.md) for a prioritized roadmap of future optimizations.