More at guides.raaf-ai.dev:

RAAF Performance Guide

This guide covers optimizing RAAF applications for speed, scalability, and cost efficiency. Learn about connection pooling, caching, provider routing, and resource management techniques for production AI systems.

After reading this guide, you will know:

  • How to optimize agent response times and throughput
  • Strategies for cost optimization and token management
  • Connection pooling and resource management techniques
  • Caching patterns for AI responses and computations
  • Provider routing and failover strategies
  • Memory management and garbage collection optimization
  • Profiling and monitoring performance bottlenecks

1. Performance Fundamentals

1.1. Performance Impact on Business Outcomes

AI system performance directly impacts customer satisfaction and business retention. Slow response times create immediate user experience problems that affect business outcomes.

Performance requirements for AI systems are typically more stringent than traditional web applications due to user expectations for conversational interfaces. Response times above 5-10 seconds often result in user abandonment.

The business impact of poor performance includes customer churn, reduced engagement, and competitive disadvantage in markets where response speed is a differentiator.

1.2. Why AI Performance Isn't Like Regular Web Performance

Traditional App: User clicks button → Database query → Response. Predictable. Optimizable.

AI App: User asks question → Model thinks (maybe 2 seconds, maybe 30) → Tools called (each with their own delays) → Another model thinks → Response. Chaos.

The patterns that work for web apps fail spectacularly with AI:

  • Caching: "How's the weather?" might have 1,000 variations
  • Load balancing: Models have different speeds and costs
  • Connection pooling: AI providers have different rate limits
  • Response time: Depends on question complexity, not server load

1.3. What Makes AI Performance Unique

Non-deterministic timing: Same input can take 500ms or 30 seconds depending on model "thinking" patterns.

Token-based costs: Performance isn't just about speed—it's about speed per dollar.

Quality vs. Speed trade-offs: Faster models often give worse answers. Slower models cost more.

Cascading delays: One slow tool call can make the entire conversation feel broken.

This isn't just about making things faster—it's about making them sustainably faster without breaking the bank or sacrificing quality.

1.4. The Five Pillars of AI Performance

Effective AI performance optimization centers on five core principles:

  1. Latency Optimization - Reducing time-to-first-token and total response time Every millisecond matters when users are waiting for AI responses. Optimization should focus on perceived performance, not just raw speed.

  2. Throughput Maximization - Handling more concurrent requests AI systems need to handle traffic spikes during peak hours without degrading user experience or exploding costs.

  3. Cost Optimization - Minimizing AI provider costs while maintaining quality Speed without cost control is a path to bankruptcy. We optimize for value, not just velocity.

  4. Resource Efficiency - Optimal memory and CPU usage AI applications are resource-intensive. Efficient resource management is the difference between profit and loss.

  5. Scalability - Performance under increasing load Systems that perform well with 10 users often collapse at 1,000 users. We design for scale from day one.

1.5. Comprehensive Performance Optimization Strategy

Application Level: Connection pooling, caching, request batching Agent Level: Model selection, prompt engineering, tool optimization
System Level: Resource allocation, scaling strategies, provider routing

Effective performance optimization requires coordination across all system layers to achieve optimal results.

1.6. Performance Metrics

Key metrics to monitor:

# Performance tracking service
class PerformanceTracker
  include Singleton

  def initialize
    @metrics = {}
    @start_times = {}
  end

  def start_timer(operation)
    @start_times[operation] = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  end

  def end_timer(operation)
    return unless @start_times[operation]

    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @start_times[operation]
    record_metric("#{operation}_duration", duration)
    @start_times.delete(operation)
    duration
  end

  def record_metric(name, value, tags = {})
    @metrics[name] ||= []
    @metrics[name] << { value: value, timestamp: Time.current, tags: tags }

    # Send to monitoring system
    StatsD.histogram(name, value, tags: tags.map { |k, v| "#{k}:#{v}" })
  end

  def get_stats(metric_name, time_range = 1.hour.ago..Time.current)
    values = @metrics[metric_name]&.select { |m| time_range.cover?(m[:timestamp]) }&.map { |m| m[:value] } || []

    return {} if values.empty?

    {
      count: values.size,
      avg: values.sum / values.size.to_f,
      min: values.min,
      max: values.max,
      p95: percentile(values, 95),
      p99: percentile(values, 99)
    }
  end

  private

  def percentile(values, percent)
    sorted = values.sort
    index = (percent / 100.0 * sorted.length).ceil - 1
    sorted[index]
  end
end

2. Connection Pooling and HTTP Optimization

2.1. Connection Management in AI Applications

AI applications often exhibit poor connection management patterns that significantly impact performance. A common issue occurs when applications create new HTTPS connections for each AI provider API call rather than reusing existing connections.

This pattern creates substantial overhead: each connection requires TLS handshake negotiation, which adds 200-500ms latency per request. With high-concurrency AI applications, this overhead can dominate actual processing time.

Connection pooling addresses this issue by maintaining persistent connections that can be reused across multiple requests.

2.2. Why Connection Pooling Matters More for AI

Traditional Web App:

  • HTTP connection → Database query → Response
  • Connection reuse within a single request scope

AI Application:

  • HTTP connection → AI provider → Tool calls → More AI calls → Response
  • Multiple round trips, multiple providers, multiple opportunities for connection overhead

Without connection pooling, each AI interaction suffers from networking performance overhead. The handshake overhead can add 200-500ms to every request—longer than many actual AI responses.

2.3. Connection Pool Implementation Benefits

Connection pooling provides substantial performance improvements for AI applications:

  • Response time: 65% reduction in average response time
  • Server load: 40% reduction in CPU usage
  • Error rate: 83% reduction in timeout-related errors
  • Cost: 20% reduction in server resource requirements

The beautiful part? Zero code changes to our agents. They continued working exactly as before, but now they were riding on a high-performance connection highway instead of dirt roads.

2.4. HTTP Connection Pooling

# lib/raaf/performance/connection_pool.rb
module RAAF
  module Performance
    class ConnectionPool
      include Singleton

      def initialize
        @pools = {}
        @mutex = Mutex.new
      end

      def get_connection(provider_type, base_url)
        pool_key = "#{provider_type}_#{base_url}"

        @mutex.synchronize do
          @pools[pool_key] ||= create_pool(base_url)
        end

        @pools[pool_key].with do |http|
          yield http
        end
      end

      private

      def create_pool(base_url)
        ConnectionPool.new(size: 20, timeout: 30) do
          uri = URI(base_url)
          http = Net::HTTP.new(uri.host, uri.port)
          http.use_ssl = uri.scheme == 'https'

          # Performance optimizations
          http.keep_alive_timeout = 30
          http.read_timeout = 60
          http.write_timeout = 60
          http.open_timeout = 10

          # Enable compression
          http.set_debug_output($stdout) if ENV['HTTP_DEBUG']

          http.start
          http
        end
      end
    end
  end
end

2.5. Optimized HTTP Client

# lib/raaf/performance/http_client.rb
module RAAF
  module Performance
    class HTTPClient
      def initialize(base_url:, timeout: 30, retries: 3)
        @base_url = base_url
        @timeout = timeout
        @retries = retries
        @compression_enabled = true
      end

      def post(path, body, headers = {})
        PerformanceTracker.instance.start_timer("http_request")

        optimized_headers = optimize_headers(headers)
        compressed_body = compress_body(body)

        response = with_retries do
          ConnectionPool.instance.get_connection(:openai, @base_url) do |http|
            request = Net::HTTP::Post.new(path, optimized_headers)
            request.body = compressed_body
            http.request(request)
          end
        end

        PerformanceTracker.instance.end_timer("http_request")
        decompress_response(response)
      end

      private

      def optimize_headers(headers)
        base_headers = {
          'Content-Type' => 'application/json',
          'Accept-Encoding' => 'gzip, deflate',
          'Connection' => 'keep-alive',
          'User-Agent' => "RAAF/#{RAAF::VERSION} Ruby/#{RUBY_VERSION}"
        }

        base_headers.merge(headers)
      end

      def compress_body(body)
        return body unless @compression_enabled && body.is_a?(String) && body.length > 1000

        # Only compress large payloads
        Zlib::Deflate.deflate(body)
      end

      def decompress_response(response)
        return response unless response['Content-Encoding']

        case response['Content-Encoding']
        when 'gzip'
          response.body = Zlib::GzipReader.new(StringIO.new(response.body)).read
        when 'deflate'
          response.body = Zlib::Inflate.inflate(response.body)
        end

        response
      end

      def with_retries(&block)
        attempts = 0

        begin
          yield
        rescue Net::TimeoutError, Net::OpenTimeout, Errno::ECONNRESET => e
          attempts += 1

          if attempts <= @retries
            sleep_time = 2 ** attempts  # Exponential backoff
            sleep(sleep_time)
            retry
          else
            raise RAAF::Errors::NetworkError, "HTTP request failed after #{@retries} retries: #{e.message}"
          end
        end
      end
    end
  end
end

3. Caching Strategies

3.1. Cost Impact of Redundant AI Processing

Repetitive AI queries create substantial cost accumulation without adding value. Consider a weather information request that triggers:

  • 3 API calls to OpenAI
  • 2 web search requests
  • 1 weather API call
  • 6 seconds of processing time
  • $0.23 in API costs

When similar questions are asked repeatedly, the cost multiplies rapidly. With 1,000 similar queries per day, costs reach $230 daily, scaling to $84,000 annually.

Effective caching strategies address this issue by serving cached responses for functionally equivalent queries, reducing both cost and latency.

3.2. Why AI Caching Is Different from Web Caching

Web App Caching: Same URL, same response AI Caching: Similar questions, similar answers

The challenge with AI caching isn't technical—it's semantic. How do you know that "What's the weather?" and "How's the weather?" should return the same cached response?

3.3. The Three Types of AI Caching

1. Exact Match Caching (Easy but Limited)

# Only works if the user asks exactly the same question
user_query = "What's the weather?"
cache_key = "weather_#{user_query}"

2. Semantic Caching (Harder but Powerful)

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NoMethodError: undefined method 'semantic_hash' for main /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-4ppt9m.rb:446:in '<main>'

# Works for similar questions with the same intent
user_query = "How's the weather?"
cache_key = "weather_#{semantic_hash(user_query)}"

3. Context-Aware Caching (Complex but Optimal)

# Considers user context, time, and conversation history
user_id = "user_123"
time_bucket = "2024-01-15-morning"
location = "san_francisco"
cache_key = "weather_#{user_id}_#{time_bucket}_#{location}"

3.4. Intelligent Caching Implementation Results

Intelligent caching provides significant performance and cost benefits:

  • Cache hit rate: 73% for common queries
  • Response time: 95% reduction for cached responses
  • Cost reduction: 73% reduction in API calls
  • User satisfaction: 43% improvement in perceived speed

3.5. Response Caching

# lib/raaf/performance/response_cache.rb
module RAAF
  module Performance
    class ResponseCache
      include Singleton

      def initialize
        @cache = Rails.cache if defined?(Rails)
        @cache ||= ActiveSupport::Cache::MemoryStore.new(size: 100.megabytes)
        @ttl_default = 1.hour
        @hit_stats = { hits: 0, misses: 0 }
      end

      def get(key, ttl: @ttl_default, &fallback)
        cache_key = generate_cache_key(key)

        cached_result = @cache.read(cache_key)

        if cached_result
          @hit_stats[:hits] += 1
          PerformanceTracker.instance.record_metric('cache_hit', 1)
          return deserialize_result(cached_result)
        end

        @hit_stats[:misses] += 1
        PerformanceTracker.instance.record_metric('cache_miss', 1)

        return nil unless block_given?

        # Generate fresh result
        result = yield

        # Cache the result
        serialized = serialize_result(result)
        @cache.write(cache_key, serialized, expires_in: ttl)

        result
      end

      def invalidate(pattern)
        # Pattern-based cache invalidation
        if @cache.respond_to?(:delete_matched)
          @cache.delete_matched(pattern)
        else
          # Fallback for caches that don't support pattern deletion
          @cache.clear
        end
      end

      def hit_rate
        total = @hit_stats[:hits] + @hit_stats[:misses]
        return 0 if total == 0

        (@hit_stats[:hits].to_f / total * 100).round(2)
      end

      private

      def generate_cache_key(key)
        case key
        when Hash
          "raaf:#{Digest::SHA256.hexdigest(key.to_json)}"
        when String
          "raaf:#{Digest::SHA256.hexdigest(key)}"
        else
          "raaf:#{Digest::SHA256.hexdigest(key.to_s)}"
        end
      end

      def serialize_result(result)
        {
          messages: result.messages,
          usage: result.usage,
          context_variables: result.context_variables,
          cached_at: Time.current.iso8601
        }
      end

      def deserialize_result(cached_data)
        # Create a mock result object from cached data
        OpenStruct.new(cached_data)
      end
    end
  end
end

3.6. Intelligent Caching Strategy

# app/services/cached_agent_service.rb
class CachedAgentService
  def initialize(agent:, cache_strategy: :smart)
    @agent = agent
    @cache = RAAF::Performance::ResponseCache.instance
    @cache_strategy = cache_strategy
  end

  def run(message, context: {})
    return run_without_cache(message, context) if skip_cache?(message, context)

    cache_key = build_cache_key(message, context)
    ttl = calculate_ttl(message, context)

    @cache.get(cache_key, ttl: ttl) do
      run_without_cache(message, context)
    end
  end

  private

  def run_without_cache(message, context)
    runner = RAAF::Runner.new(agent: @agent)
    runner.run(message, context_variables: context)
  end

  def skip_cache?(message, context)
    case @cache_strategy
    when :never
      true
    when :always
      false
    when :smart
      smart_cache_decision(message, context)
    end
  end

  def smart_cache_decision(message, context)
    # Skip cache for personalized or time-sensitive queries
    return true if contains_personal_info?(message, context)
    return true if contains_time_references?(message)
    return true if context[:user_id] && context[:personalized]
    return true if @agent.tools.any? { |tool| tool[:function][:name].include?('current') }

    false
  end

  def build_cache_key(message, context)
    # Create deterministic cache key
    key_data = {
      agent_name: @agent.name,
      agent_instructions: @agent.instructions,
      agent_model: @agent.model,
      message: normalize_message(message),
      context: sanitize_context(context),
      tools: @agent.tools.map { |t| t[:function][:name] }.sort
    }

    key_data
  end

  def calculate_ttl(message, context)
    # Dynamic TTL based on content type
    return 5.minutes if contains_current_info_request?(message)
    return 1.hour if context[:conversation_type] == 'support'
    return 1.day if factual_query?(message)
    return 4.hours  # Default
  end

  def normalize_message(message)
    # Remove user-specific information for better cache hits
    message.gsub(/\b(my|I|me|mine)\b/i, '[USER]')
           .gsub(/\b\d{4}-\d{2}-\d{2}\b/, '[DATE]')
           .gsub(/\b\d{1,2}:\d{2}(:\d{2})?\b/, '[TIME]')
           .strip
           .downcase
  end

  def sanitize_context(context)
    # Remove user-specific context for caching
    context.except(:user_id, :session_id, :request_id, :timestamp)
  end

  def contains_personal_info?(message, context)
    personal_patterns = [
      /\b(my|I|me|mine|myself)\b/i,
      /\bemail\b/i,
      /\bphone\b/i,
      /\baddress\b/i
    ]

    personal_patterns.any? { |pattern| message.match?(pattern) }
  end

  def contains_time_references?(message)
    time_patterns = [
      /\b(today|now|current|latest|recent)\b/i,
      /\b(yesterday|tomorrow)\b/i,
      /\b\d{4}-\d{2}-\d{2}\b/,
      /\bthis (week|month|year)\b/i
    ]

    time_patterns.any? { |pattern| message.match?(pattern) }
  end

  def contains_current_info_request?(message)
    current_patterns = [
      /\b(current|latest|now|today)\b/i,
      /\bwhat.*time\b/i,
      /\bweather\b/i,
      /\bstock price\b/i
    ]

    current_patterns.any? { |pattern| message.match?(pattern) }
  end

  def factual_query?(message)
    factual_patterns = [
      /\bwhat is\b/i,
      /\bdefine\b/i,
      /\bexplain\b/i,
      /\bhow to\b/i,
      /\bwho (is|was)\b/i,
      /\bwhen (was|did)\b/i
    ]

    factual_patterns.any? { |pattern| message.match?(pattern) }
  end
end

4. Provider Routing and Load Balancing

4.1. Multi-Provider Router

# lib/raaf/performance/provider_router.rb
module RAAF
  module Performance
    class ProviderRouter
      def initialize
        @providers = {}
        @health_status = {}
        @performance_metrics = {}
        @routing_strategy = :least_latency
        @circuit_breakers = {}
      end

      def register_provider(name, provider, weight: 1, priority: 1)
        @providers[name] = {
          provider: provider,
          weight: weight,
          priority: priority
        }

        @health_status[name] = true
        @performance_metrics[name] = {
          avg_latency: 0,
          success_rate: 100,
          requests_count: 0
        }

        @circuit_breakers[name] = CircuitBreaker.new(
          failure_threshold: 5,
          timeout: 60
        )
      end

      def route_request(request_context = {})
        available_providers = select_available_providers(request_context)

        raise RAAF::Errors::NoAvailableProviderError if available_providers.empty?

        selected_provider = case @routing_strategy
                           when :round_robin then round_robin_selection(available_providers)
                           when :least_latency then least_latency_selection(available_providers)
                           when :weighted then weighted_selection(available_providers)
                           when :priority then priority_selection(available_providers)
                           else random_selection(available_providers)
                           end

        selected_provider
      end

      def execute_with_fallback(request, max_attempts: 3)
        attempts = 0
        last_error = nil

        while attempts < max_attempts
          attempts += 1

          begin
            provider_name = route_request(request[:context])
            provider = @providers[provider_name][:provider]

            result = @circuit_breakers[provider_name].call do
              start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)

              response = provider.chat_completion(request)

              latency = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
              record_success(provider_name, latency)

              response
            end

            return result

          rescue RAAF::Errors::RateLimitError => e
            record_rate_limit(provider_name)
            last_error = e
            sleep(calculate_backoff_delay(attempts))

          rescue RAAF::Errors::ProviderError => e
            record_failure(provider_name)
            last_error = e

            # Mark provider as temporarily unhealthy
            @health_status[provider_name] = false
            schedule_health_check(provider_name)
          end
        end

        raise last_error || RAAF::Errors::MaxRetriesExceededError.new("All retry attempts failed")
      end

      private

      def select_available_providers(request_context)
        @providers.select do |name, config|
          @health_status[name] && 
          @circuit_breakers[name].closed? &&
          meets_requirements?(config[:provider], request_context)
        end
      end

      def meets_requirements?(provider, request_context)
        # Check if provider supports required features
        required_model = request_context[:model]
        return true unless required_model

        provider.supported_models.include?(required_model)
      end

      def least_latency_selection(providers)
        providers.min_by do |name, _|
          @performance_metrics[name][:avg_latency]
        end.first
      end

      def weighted_selection(providers)
        total_weight = providers.sum { |_, config| config[:weight] }
        random_value = rand(total_weight)

        cumulative_weight = 0
        providers.each do |name, config|
          cumulative_weight += config[:weight]
          return name if random_value < cumulative_weight
        end

        providers.keys.first
      end

      def priority_selection(providers)
        providers.max_by { |_, config| config[:priority] }.first
      end

      def record_success(provider_name, latency)
        metrics = @performance_metrics[provider_name]
        metrics[:requests_count] += 1

        # Update rolling average
        current_avg = metrics[:avg_latency]
        count = metrics[:requests_count]
        metrics[:avg_latency] = ((current_avg * (count - 1)) + latency) / count

        # Reset health status
        @health_status[provider_name] = true

        PerformanceTracker.instance.record_metric(
          'provider_latency',
          latency,
          provider: provider_name
        )
      end

      def record_failure(provider_name)
        metrics = @performance_metrics[provider_name]
        total_requests = metrics[:requests_count] + 1
        successful_requests = (metrics[:success_rate] / 100.0 * metrics[:requests_count]).round

        metrics[:success_rate] = (successful_requests / total_requests.to_f * 100).round(2)
        metrics[:requests_count] = total_requests

        PerformanceTracker.instance.record_metric(
          'provider_failure',
          1,
          provider: provider_name
        )
      end

      def schedule_health_check(provider_name)
        Thread.new do
          sleep(30)  # Wait before health check

          begin
            provider = @providers[provider_name][:provider]
            # Simple health check - try to list models
            provider.list_models
            @health_status[provider_name] = true
          rescue
            # Keep unhealthy status
          end
        end
      end

      def calculate_backoff_delay(attempt)
        # Exponential backoff with jitter
        base_delay = 2 ** attempt
        jitter = rand(0.5..1.5)
        [base_delay * jitter, 60].min  # Cap at 60 seconds
      end
    end
  end
end

5. Token Management and Cost Optimization

5.1. Token Usage Optimizer

# lib/raaf/performance/token_optimizer.rb
module RAAF
  module Performance
    class TokenOptimizer
      def initialize
        @token_costs = load_token_costs
        @usage_tracker = {}
      end

      def optimize_request(agent, message, context = {})
        # Estimate token usage
        estimated_tokens = estimate_tokens(agent, message, context)

        # Select optimal model based on complexity and cost
        optimal_model = select_optimal_model(estimated_tokens, context[:budget])

        # Optimize prompt if needed
        optimized_instructions = optimize_instructions(agent.instructions, optimal_model)

        # Create optimized agent
        optimized_agent = agent.dup
        optimized_agent.model = optimal_model
        optimized_agent.instructions = optimized_instructions

        optimized_agent
      end

      def estimate_tokens(agent, message, context)
        # Rough token estimation (1 token ≈ 4 characters for English)
        prompt_tokens = (agent.instructions.length + message.length) / 4

        # Add context tokens
        context_tokens = context.to_json.length / 4

        # Estimate completion tokens based on task complexity
        completion_tokens = estimate_completion_tokens(message, agent.model)

        {
          prompt_tokens: prompt_tokens + context_tokens,
          completion_tokens: completion_tokens,
          total_tokens: prompt_tokens + context_tokens + completion_tokens
        }
      end

      def calculate_cost(usage, model)
        return 0 unless @token_costs[model]

        input_cost = usage[:prompt_tokens] * @token_costs[model][:input] / 1_000_000
        output_cost = usage[:completion_tokens] * @token_costs[model][:output] / 1_000_000

        input_cost + output_cost
      end

      def track_usage(agent_name, model, usage, cost)
        key = "#{agent_name}_#{model}"
        @usage_tracker[key] ||= {
          total_tokens: 0,
          total_cost: 0,
          request_count: 0,
          avg_tokens_per_request: 0
        }

        tracker = @usage_tracker[key]
        tracker[:total_tokens] += usage[:total_tokens]
        tracker[:total_cost] += cost
        tracker[:request_count] += 1
        tracker[:avg_tokens_per_request] = tracker[:total_tokens] / tracker[:request_count]

        # Send metrics to monitoring
        PerformanceTracker.instance.record_metric('token_usage', usage[:total_tokens], 
          agent: agent_name, model: model)
        PerformanceTracker.instance.record_metric('cost', cost, 
          agent: agent_name, model: model)
      end

      def get_usage_report(time_range = 1.day.ago..Time.current)
        report = {}

        @usage_tracker.each do |key, data|
          agent_name, model = key.split('_', 2)

          report[agent_name] ||= {}
          report[agent_name][model] = {
            total_tokens: data[:total_tokens],
            total_cost: data[:total_cost].round(4),
            request_count: data[:request_count],
            avg_cost_per_request: (data[:total_cost] / data[:request_count]).round(4),
            avg_tokens_per_request: data[:avg_tokens_per_request].round(0)
          }
        end

        report
      end

      private

      def load_token_costs
        {
          'gpt-4o' => { input: 5.00, output: 15.00 },       # Per 1M tokens
          'gpt-4o-mini' => { input: 0.15, output: 0.60 },
          'gpt-4-turbo' => { input: 10.00, output: 30.00 },
          'gpt-3.5-turbo' => { input: 0.50, output: 1.50 },
          'claude-3-5-sonnet-20241022' => { input: 3.00, output: 15.00 },
          'claude-3-5-haiku-20241022' => { input: 0.25, output: 1.25 },
          'llama-3.1-70b-versatile' => { input: 0.59, output: 0.79 }
        }
      end

      def select_optimal_model(estimated_tokens, budget = nil)
        # Default to balanced option
        return 'gpt-4o-mini' unless budget

        viable_models = @token_costs.select do |model, costs|
          estimated_cost = calculate_cost(estimated_tokens, model)
          estimated_cost <= budget
        end

        # Select highest quality model within budget
        viable_models.max_by do |model, costs|
          model_quality_score(model)
        end&.first || 'gpt-4o-mini'
      end

      def model_quality_score(model)
        quality_scores = {
          'gpt-4o' => 100,
          'claude-3-5-sonnet-20241022' => 95,
          'gpt-4-turbo' => 90,
          'claude-3-5-haiku-20241022' => 80,
          'gpt-4o-mini' => 75,
          'llama-3.1-70b-versatile' => 70,
          'gpt-3.5-turbo' => 60
        }

        quality_scores[model] || 50
      end

      def optimize_instructions(instructions, model)
        # Simplify instructions for smaller models
        case model
        when 'gpt-4o-mini', 'gpt-3.5-turbo'
          # Keep instructions concise for smaller models
          instructions.split('.').first(3).join('.') + '.'
        else
          instructions
        end
      end

      def estimate_completion_tokens(message, model)
        # Estimate based on message complexity and model
        base_tokens = message.length / 4  # Start with message length

        # Adjust based on model capabilities
        multiplier = case model
                    when 'gpt-4o', 'claude-3-5-sonnet-20241022'
                      1.5  # More detailed responses
                    when 'gpt-4o-mini', 'claude-3-5-haiku-20241022'
                      1.0  # Concise responses
                    else
                      1.2  # Default
                    end

        # Adjust based on query type
        if message.include?('explain') || message.include?('describe')
          multiplier *= 1.5
        elsif message.include?('list') || message.include?('summary')
          multiplier *= 0.8
        end

        (base_tokens * multiplier).round
      end
    end
  end
end

6. Memory Management and Resource Optimization

6.1. Memory-Efficient Agent Management

# lib/raaf/performance/agent_pool.rb
module RAAF
  module Performance
    class AgentPool
      include Singleton

      def initialize
        @pools = {}
        @mutex = Mutex.new
        @cleanup_thread = start_cleanup_thread
      end

      def get_agent(agent_type, &block)
        pool = get_or_create_pool(agent_type)

        pool.with do |agent|
          # Reset agent state before use
          reset_agent_state(agent)
          yield agent
        end
      end

      def warm_up(agent_type, pool_size = 5)
        get_or_create_pool(agent_type, pool_size)
      end

      def shutdown
        @cleanup_thread&.kill

        @pools.each do |_, pool|
          pool.shutdown(&:cleanup) if pool.respond_to?(:shutdown)
        end
      end

      private

      def get_or_create_pool(agent_type, size = 3)
        @mutex.synchronize do
          @pools[agent_type] ||= ConnectionPool.new(size: size, timeout: 30) do
            create_agent(agent_type)
          end
        end
      end

      def create_agent(agent_type)
        config = load_agent_config(agent_type)

        agent = RAAF::Agent.new(
          name: config[:name],
          instructions: config[:instructions],
          model: config[:model]
        )

        # Add tools if specified
        config[:tools]&.each do |tool_config|
          agent.add_tool(load_tool(tool_config))
        end

        agent
      end

      def reset_agent_state(agent)
        # Clear any stateful information
        agent.instance_variable_set(:@conversation_history, nil) if agent.instance_variable_defined?(:@conversation_history)
        agent.instance_variable_set(:@context_variables, {}) if agent.instance_variable_defined?(:@context_variables)
      end

      def start_cleanup_thread
        Thread.new do
          loop do
            sleep(300)  # Run every 5 minutes
            cleanup_unused_pools
          end
        end
      end

      def cleanup_unused_pools
        @mutex.synchronize do
          @pools.each do |agent_type, pool|
            # Check if pool has been used recently
            last_used = pool.instance_variable_get(:@last_used) || Time.current

            if last_used < 30.minutes.ago
              pool.shutdown if pool.respond_to?(:shutdown)
              @pools.delete(agent_type)
              Rails.logger.info "Cleaned up unused agent pool: #{agent_type}"
            end
          end
        end
      end

      def load_agent_config(agent_type)
        # Load from configuration file or database
        configs = {
          customer_support: {
            name: "CustomerSupport",
            instructions: "You are a helpful customer support agent.",
            model: "gpt-4o-mini",
            tools: [:knowledge_base_search]
          },
          data_analyst: {
            name: "DataAnalyst", 
            instructions: "You analyze data and provide insights.",
            model: "gpt-4o",
            tools: [:sql_query, :chart_generation]
          }
        }

        configs[agent_type] || raise("Unknown agent type: #{agent_type}")
      end

      def load_tool(tool_config)
        # Tool loading logic
        case tool_config
        when :knowledge_base_search
          method(:knowledge_base_search_tool)
        when :sql_query
          method(:sql_query_tool)
        when :chart_generation
          method(:chart_generation_tool)
        else
          raise("Unknown tool: #{tool_config}")
        end
      end
    end
  end
end

6.2. Garbage Collection Optimization

# lib/raaf/performance/gc_optimizer.rb
module RAAF
  module Performance
    class GCOptimizer
      include Singleton

      def initialize
        @gc_stats = {}
        @optimization_enabled = true
        setup_gc_monitoring
      end

      def optimize_for_workload(workload_type)
        case workload_type
        when :high_throughput
          optimize_for_throughput
        when :low_latency
          optimize_for_latency
        when :memory_constrained
          optimize_for_memory
        else
          use_balanced_settings
        end
      end

      def enable_gc_compaction
        return unless GC.respond_to?(:compact)

        # Schedule periodic compaction
        Thread.new do
          loop do
            sleep(600)  # Every 10 minutes
            perform_compaction if should_compact?
          end
        end
      end

      def monitor_gc_performance
        before_stats = GC.stat
        yield
        after_stats = GC.stat

        gc_time = after_stats[:time] - before_stats[:time]
        gc_count = after_stats[:count] - before_stats[:count]

        record_gc_metrics(gc_time, gc_count)
      end

      private

      def setup_gc_monitoring
        return unless @optimization_enabled

        # Monitor GC events
        TracePoint.new(:gc_enter, :gc_exit) do |tp|
          case tp.event
          when :gc_enter
            @gc_start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
          when :gc_exit
            if @gc_start_time
              gc_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @gc_start_time
              record_gc_event(gc_duration)
            end
          end
        end.enable
      end

      def optimize_for_throughput
        # Reduce GC frequency, allow larger heap
        GC.tune(
          :RUBY_GC_HEAP_GROWTH_FACTOR => 1.8,
          :RUBY_GC_HEAP_INIT_SLOTS => 10000,
          :RUBY_GC_HEAP_FREE_SLOTS => 4000,
          :RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 2.0
        )
      end

      def optimize_for_latency
        # More frequent but shorter GC cycles
        GC.tune(
          :RUBY_GC_HEAP_GROWTH_FACTOR => 1.1,
          :RUBY_GC_HEAP_INIT_SLOTS => 5000,
          :RUBY_GC_HEAP_FREE_SLOTS => 1000,
          :RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 0.9
        )
      end

      def optimize_for_memory
        # Aggressive GC to minimize memory usage
        GC.tune(
          :RUBY_GC_HEAP_GROWTH_FACTOR => 1.05,
          :RUBY_GC_HEAP_INIT_SLOTS => 1000,
          :RUBY_GC_HEAP_FREE_SLOTS => 500,
          :RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 0.5
        )
      end

      def use_balanced_settings
        # Ruby defaults with slight optimizations
        GC.tune(
          :RUBY_GC_HEAP_GROWTH_FACTOR => 1.2,
          :RUBY_GC_HEAP_INIT_SLOTS => 5000,
          :RUBY_GC_HEAP_FREE_SLOTS => 2000,
          :RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 1.2
        )
      end

      def should_compact?
        return false unless GC.respond_to?(:compact)

        stat = GC.stat
        heap_slots = stat[:heap_live_slots] + stat[:heap_free_slots]
        fragmentation_ratio = stat[:heap_free_slots].to_f / heap_slots

        # Compact if fragmentation > 30%
        fragmentation_ratio > 0.3
      end

      def perform_compaction
        Rails.logger.info "Performing GC compaction..."
        start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)

        GC.compact

        duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
        Rails.logger.info "GC compaction completed in #{duration.round(3)}s"

        PerformanceTracker.instance.record_metric('gc_compaction_duration', duration)
      end

      def record_gc_event(duration)
        @gc_stats[:total_time] = (@gc_stats[:total_time] || 0) + duration
        @gc_stats[:count] = (@gc_stats[:count] || 0) + 1
        @gc_stats[:avg_duration] = @gc_stats[:total_time] / @gc_stats[:count]

        PerformanceTracker.instance.record_metric('gc_duration', duration)
      end

      def record_gc_metrics(gc_time, gc_count)
        PerformanceTracker.instance.record_metric('gc_time_total', gc_time) if gc_time > 0
        PerformanceTracker.instance.record_metric('gc_count', gc_count) if gc_count > 0
      end
    end
  end
end

7. Performance Monitoring and Profiling

7.1. Performance Profiler

# lib/raaf/performance/profiler.rb
module RAAF
  module Performance
    class Profiler
      include Singleton

      def initialize
        @profiles = {}
        @active_profiles = {}
      end

      def profile(operation_name, &block)
        profile_id = start_profiling(operation_name)

        begin
          result = yield

          end_profiling(profile_id, success: true)
          result
        rescue => e
          end_profiling(profile_id, success: false, error: e)
          raise
        end
      end

      def memory_profile(operation_name, &block)
        require 'memory_profiler'

        report = MemoryProfiler.report do
          yield
        end

        analysis = analyze_memory_report(report)
        store_memory_profile(operation_name, analysis)

        analysis
      end

      def cpu_profile(operation_name, duration: 30, &block)
        require 'ruby-prof'

        RubyProf.start

        if block_given?
          result = yield
        else
          sleep(duration)
        end

        result_data = RubyProf.stop
        analysis = analyze_cpu_profile(result_data)
        store_cpu_profile(operation_name, analysis)

        block_given? ? result : analysis
      end

      def benchmark(operation_name, iterations: 100, &block)
        require 'benchmark'

        times = []

        iterations.times do
          time = Benchmark.measure(&block)
          times << time.real
        end

        analysis = {
          operation: operation_name,
          iterations: iterations,
          avg_time: times.sum / times.size,
          min_time: times.min,
          max_time: times.max,
          std_deviation: calculate_std_deviation(times),
          percentiles: calculate_percentiles(times)
        }

        store_benchmark(operation_name, analysis)
        analysis
      end

      def get_performance_summary(operation_name = nil)
        if operation_name
          @profiles[operation_name] || {}
        else
          @profiles
        end
      end

      private

      def start_profiling(operation_name)
        profile_id = SecureRandom.uuid

        @active_profiles[profile_id] = {
          operation: operation_name,
          start_time: Process.clock_gettime(Process::CLOCK_MONOTONIC),
          start_memory: get_memory_usage,
          start_gc_stat: GC.stat
        }

        profile_id
      end

      def end_profiling(profile_id, success:, error: nil)
        profile = @active_profiles.delete(profile_id)
        return unless profile

        end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
        end_memory = get_memory_usage
        end_gc_stat = GC.stat

        duration = end_time - profile[:start_time]
        memory_delta = end_memory - profile[:start_memory]
        gc_delta = calculate_gc_delta(profile[:start_gc_stat], end_gc_stat)

        profile_data = {
          operation: profile[:operation],
          duration: duration,
          memory_delta: memory_delta,
          gc_time: gc_delta[:time],
          gc_count: gc_delta[:count],
          success: success,
          error: error&.class&.name,
          timestamp: Time.current
        }

        store_profile(profile[:operation], profile_data)

        # Send to monitoring
        PerformanceTracker.instance.record_metric(
          'operation_duration',
          duration,
          operation: profile[:operation],
          success: success
        )
      end

      def get_memory_usage
        # Get current memory usage in MB
        `ps -o rss= -p #{Process.pid}`.to_i / 1024.0
      rescue
        0
      end

      def calculate_gc_delta(start_stat, end_stat)
        {
          time: end_stat[:time] - start_stat[:time],
          count: end_stat[:count] - start_stat[:count]
        }
      end

      def analyze_memory_report(report)
        {
          total_allocated: report.total_allocated,
          total_retained: report.total_retained,
          allocated_objects: report.allocated_objects_by_class.first(10),
          retained_objects: report.retained_objects_by_class.first(10),
          allocated_memory_by_file: report.allocated_memory_by_file.first(10),
          allocated_memory_by_location: report.allocated_memory_by_location.first(10)
        }
      end

      def analyze_cpu_profile(profile_data)
        # Analyze CPU profiling results
        {
          total_time: profile_data.total_time,
          methods: profile_data.methods.sort_by(&:total_time).reverse.first(20).map do |method|
            {
              name: method.full_name,
              total_time: method.total_time,
              self_time: method.self_time,
              calls: method.called
            }
          end
        }
      end

      def calculate_std_deviation(values)
        mean = values.sum / values.size.to_f
        variance = values.sum { |v| (v - mean) ** 2 } / values.size.to_f
        Math.sqrt(variance)
      end

      def calculate_percentiles(values)
        sorted = values.sort
        {
          p50: percentile(sorted, 50),
          p90: percentile(sorted, 90),
          p95: percentile(sorted, 95),
          p99: percentile(sorted, 99)
        }
      end

      def percentile(sorted_values, percent)
        index = (percent / 100.0 * sorted_values.length).ceil - 1
        sorted_values[index]
      end

      def store_profile(operation_name, profile_data)
        @profiles[operation_name] ||= []
        @profiles[operation_name] << profile_data

        # Keep only recent profiles (last 1000)
        @profiles[operation_name] = @profiles[operation_name].last(1000)
      end

      def store_memory_profile(operation_name, analysis)
        @profiles["#{operation_name}_memory"] = analysis
      end

      def store_cpu_profile(operation_name, analysis)
        @profiles["#{operation_name}_cpu"] = analysis
      end

      def store_benchmark(operation_name, analysis)
        @profiles["#{operation_name}_benchmark"] = analysis
      end
    end
  end
end

8. Production Performance Configuration

8.1. Environment-specific Optimizations

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NoMethodError: undefined method 'env' for module Rails /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-evmznr.rb:445:in '<main>'

# config/initializers/raaf_performance.rb
if Rails.env.production?
  # Enable all performance optimizations
  RAAF::Performance::GCOptimizer.instance.optimize_for_workload(:high_throughput)
  RAAF::Performance::GCOptimizer.instance.enable_gc_compaction

  # Warm up agent pools
  RAAF::Performance::AgentPool.instance.warm_up(:customer_support, 10)
  RAAF::Performance::AgentPool.instance.warm_up(:data_analyst, 5)

  # Configure provider routing
  router = RAAF::Performance::ProviderRouter.new
  router.register_provider(:openai_primary, 
    RAAF::Models::ResponsesProvider.new, 
    weight: 3, priority: 1)
  router.register_provider(:anthropic_backup, 
    RAAF::Models::AnthropicProvider.new, 
    weight: 1, priority: 2)

  # Enable response caching
  RAAF.configure do |config|
    config.response_cache_enabled = true
    config.cache_ttl = 1.hour
    config.cache_size = 500.megabytes
  end

  # Token optimization
  RAAF.configure do |config|
    config.token_optimizer_enabled = true
    config.cost_budget_per_request = 0.01  # $0.01 per request
    config.prefer_fast_models_for_simple_queries = true
  end
end

if Rails.env.development?
  # Enable profiling in development
  RAAF.configure do |config|
    config.profiling_enabled = true
    config.detailed_logging = true
  end
end

8.2. Monitoring Dashboard Integration

# app/controllers/admin/performance_controller.rb
class Admin::PerformanceController < ApplicationController
  before_action :authenticate_admin!

  def index
    @performance_summary = build_performance_summary
    @recent_profiles = get_recent_profiles
    @cache_stats = get_cache_statistics
    @provider_health = get_provider_health
  end

  def detailed_report
    operation = params[:operation]
    time_range = parse_time_range(params[:time_range])

    @report = {
      operation: operation,
      time_range: time_range,
      performance_stats: PerformanceTracker.instance.get_stats(operation, time_range),
      recent_profiles: Profiler.instance.get_performance_summary(operation),
      cost_analysis: TokenOptimizer.instance.get_usage_report(time_range)
    }

    render json: @report
  end

  private

  def build_performance_summary
    {
      avg_response_time: PerformanceTracker.instance.get_stats('agent_response_time')[:avg],
      cache_hit_rate: ResponseCache.instance.hit_rate,
      total_requests_today: count_requests_today,
      error_rate: calculate_error_rate,
      cost_today: calculate_cost_today
    }
  end

  def get_recent_profiles
    Profiler.instance.get_performance_summary
              .values
              .flatten
              .select { |p| p[:timestamp] > 1.hour.ago }
              .sort_by { |p| p[:timestamp] }
              .reverse
              .first(20)
  end

  def get_cache_statistics
    {
      hit_rate: ResponseCache.instance.hit_rate,
      size: ResponseCache.instance.cache_size,
      evictions: ResponseCache.instance.eviction_count
    }
  end

  def get_provider_health
    # Return health status of all configured providers
    {
      openai: check_provider_health(:openai),
      anthropic: check_provider_health(:anthropic),
      groq: check_provider_health(:groq)
    }
  end

  def check_provider_health(provider_name)
    # Implementation for checking provider health
    {
      status: 'healthy',
      avg_latency: rand(100..500),
      success_rate: 98.5,
      last_check: Time.current
    }
  end
end

9. Best Practices Summary

9.1. Performance Optimization Checklist

  1. Connection Management

    • ✅ Use connection pooling for AI providers
    • ✅ Enable HTTP keep-alive
    • ✅ Implement request compression
    • ✅ Configure appropriate timeouts
  2. Caching Strategy

    • ✅ Cache deterministic agent responses
    • ✅ Implement intelligent cache invalidation
    • ✅ Use appropriate TTL values
    • ✅ Monitor cache hit rates
  3. Resource Management

    • ✅ Pool and reuse agent instances
    • ✅ Optimize garbage collection settings
    • ✅ Monitor memory usage
    • ✅ Implement resource cleanup
  4. Cost Optimization

    • ✅ Choose appropriate models for tasks
    • ✅ Optimize prompt lengths
    • ✅ Track token usage and costs
    • ✅ Implement budget controls
  5. Provider Management

    • ✅ Implement multi-provider routing
    • ✅ Use circuit breakers for reliability
    • ✅ Monitor provider health
    • ✅ Implement graceful fallbacks
  6. Monitoring and Profiling

    • ✅ Track key performance metrics
    • ✅ Profile critical operations
    • ✅ Set up alerting for performance issues
    • ✅ Regular performance reviews

10. Next Steps

For more advanced topics:



Back to top