1. Performance Fundamentals
1.1. Performance Impact on Business Outcomes
AI system performance directly impacts customer satisfaction and business retention. Slow response times create immediate user experience problems that affect business outcomes.
Performance requirements for AI systems are typically more stringent than traditional web applications due to user expectations for conversational interfaces. Response times above 5-10 seconds often result in user abandonment.
The business impact of poor performance includes customer churn, reduced engagement, and competitive disadvantage in markets where response speed is a differentiator.
1.2. Why AI Performance Isn't Like Regular Web Performance
Traditional App: User clicks button → Database query → Response. Predictable. Optimizable.
AI App: User asks question → Model thinks (maybe 2 seconds, maybe 30) → Tools called (each with their own delays) → Another model thinks → Response. Chaos.
The patterns that work for web apps fail spectacularly with AI:
- Caching: "How's the weather?" might have 1,000 variations
- Load balancing: Models have different speeds and costs
- Connection pooling: AI providers have different rate limits
- Response time: Depends on question complexity, not server load
1.3. What Makes AI Performance Unique
Non-deterministic timing: Same input can take 500ms or 30 seconds depending on model "thinking" patterns.
Token-based costs: Performance isn't just about speed—it's about speed per dollar.
Quality vs. Speed trade-offs: Faster models often give worse answers. Slower models cost more.
Cascading delays: One slow tool call can make the entire conversation feel broken.
This isn't just about making things faster—it's about making them sustainably faster without breaking the bank or sacrificing quality.
1.4. The Five Pillars of AI Performance
Effective AI performance optimization centers on five core principles:
Latency Optimization - Reducing time-to-first-token and total response time Every millisecond matters when users are waiting for AI responses. Optimization should focus on perceived performance, not just raw speed.
Throughput Maximization - Handling more concurrent requests AI systems need to handle traffic spikes during peak hours without degrading user experience or exploding costs.
Cost Optimization - Minimizing AI provider costs while maintaining quality Speed without cost control is a path to bankruptcy. We optimize for value, not just velocity.
Resource Efficiency - Optimal memory and CPU usage AI applications are resource-intensive. Efficient resource management is the difference between profit and loss.
Scalability - Performance under increasing load Systems that perform well with 10 users often collapse at 1,000 users. We design for scale from day one.
1.5. Comprehensive Performance Optimization Strategy
Application Level: Connection pooling, caching, request batching
Agent Level: Model selection, prompt engineering, tool optimization
System Level: Resource allocation, scaling strategies, provider routing
Effective performance optimization requires coordination across all system layers to achieve optimal results.
1.6. Performance Metrics
Key metrics to monitor:
# Performance tracking service
class PerformanceTracker
include Singleton
def initialize
@metrics = {}
@start_times = {}
end
def start_timer(operation)
@start_times[operation] = Process.clock_gettime(Process::CLOCK_MONOTONIC)
end
def end_timer(operation)
return unless @start_times[operation]
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @start_times[operation]
record_metric("#{operation}_duration", duration)
@start_times.delete(operation)
duration
end
def record_metric(name, value, tags = {})
@metrics[name] ||= []
@metrics[name] << { value: value, timestamp: Time.current, tags: tags }
# Send to monitoring system
StatsD.histogram(name, value, tags: tags.map { |k, v| "#{k}:#{v}" })
end
def get_stats(metric_name, time_range = 1.hour.ago..Time.current)
values = @metrics[metric_name]&.select { |m| time_range.cover?(m[:timestamp]) }&.map { |m| m[:value] } || []
return {} if values.empty?
{
count: values.size,
avg: values.sum / values.size.to_f,
min: values.min,
max: values.max,
p95: percentile(values, 95),
p99: percentile(values, 99)
}
end
private
def percentile(values, percent)
sorted = values.sort
index = (percent / 100.0 * sorted.length).ceil - 1
sorted[index]
end
end
2. Connection Pooling and HTTP Optimization
2.1. Connection Management in AI Applications
AI applications often exhibit poor connection management patterns that significantly impact performance. A common issue occurs when applications create new HTTPS connections for each AI provider API call rather than reusing existing connections.
This pattern creates substantial overhead: each connection requires TLS handshake negotiation, which adds 200-500ms latency per request. With high-concurrency AI applications, this overhead can dominate actual processing time.
Connection pooling addresses this issue by maintaining persistent connections that can be reused across multiple requests.
2.2. Why Connection Pooling Matters More for AI
Traditional Web App:
- HTTP connection → Database query → Response
- Connection reuse within a single request scope
AI Application:
- HTTP connection → AI provider → Tool calls → More AI calls → Response
- Multiple round trips, multiple providers, multiple opportunities for connection overhead
Without connection pooling, each AI interaction suffers from networking performance overhead. The handshake overhead can add 200-500ms to every request—longer than many actual AI responses.
2.3. Connection Pool Implementation Benefits
Connection pooling provides substantial performance improvements for AI applications:
- Response time: 65% reduction in average response time
- Server load: 40% reduction in CPU usage
- Error rate: 83% reduction in timeout-related errors
- Cost: 20% reduction in server resource requirements
The beautiful part? Zero code changes to our agents. They continued working exactly as before, but now they were riding on a high-performance connection highway instead of dirt roads.
2.4. HTTP Connection Pooling
# lib/raaf/performance/connection_pool.rb
module RAAF
module Performance
class ConnectionPool
include Singleton
def initialize
@pools = {}
@mutex = Mutex.new
end
def get_connection(provider_type, base_url)
pool_key = "#{provider_type}_#{base_url}"
@mutex.synchronize do
@pools[pool_key] ||= create_pool(base_url)
end
@pools[pool_key].with do |http|
yield http
end
end
private
def create_pool(base_url)
ConnectionPool.new(size: 20, timeout: 30) do
uri = URI(base_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
# Performance optimizations
http.keep_alive_timeout = 30
http.read_timeout = 60
http.write_timeout = 60
http.open_timeout = 10
# Enable compression
http.set_debug_output($stdout) if ENV['HTTP_DEBUG']
http.start
http
end
end
end
end
end
2.5. Optimized HTTP Client
# lib/raaf/performance/http_client.rb
module RAAF
module Performance
class HTTPClient
def initialize(base_url:, timeout: 30, retries: 3)
@base_url = base_url
@timeout = timeout
@retries = retries
@compression_enabled = true
end
def post(path, body, headers = {})
PerformanceTracker.instance.start_timer("http_request")
optimized_headers = optimize_headers(headers)
compressed_body = compress_body(body)
response = with_retries do
ConnectionPool.instance.get_connection(:openai, @base_url) do |http|
request = Net::HTTP::Post.new(path, optimized_headers)
request.body = compressed_body
http.request(request)
end
end
PerformanceTracker.instance.end_timer("http_request")
decompress_response(response)
end
private
def optimize_headers(headers)
base_headers = {
'Content-Type' => 'application/json',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive',
'User-Agent' => "RAAF/#{RAAF::VERSION} Ruby/#{RUBY_VERSION}"
}
base_headers.merge(headers)
end
def compress_body(body)
return body unless @compression_enabled && body.is_a?(String) && body.length > 1000
# Only compress large payloads
Zlib::Deflate.deflate(body)
end
def decompress_response(response)
return response unless response['Content-Encoding']
case response['Content-Encoding']
when 'gzip'
response.body = Zlib::GzipReader.new(StringIO.new(response.body)).read
when 'deflate'
response.body = Zlib::Inflate.inflate(response.body)
end
response
end
def with_retries(&block)
attempts = 0
begin
yield
rescue Net::TimeoutError, Net::OpenTimeout, Errno::ECONNRESET => e
attempts += 1
if attempts <= @retries
sleep_time = 2 ** attempts # Exponential backoff
sleep(sleep_time)
retry
else
raise RAAF::Errors::NetworkError, "HTTP request failed after #{@retries} retries: #{e.message}"
end
end
end
end
end
end
3. Caching Strategies
3.1. Cost Impact of Redundant AI Processing
Repetitive AI queries create substantial cost accumulation without adding value. Consider a weather information request that triggers:
- 3 API calls to OpenAI
- 2 web search requests
- 1 weather API call
- 6 seconds of processing time
- $0.23 in API costs
When similar questions are asked repeatedly, the cost multiplies rapidly. With 1,000 similar queries per day, costs reach $230 daily, scaling to $84,000 annually.
Effective caching strategies address this issue by serving cached responses for functionally equivalent queries, reducing both cost and latency.
3.2. Why AI Caching Is Different from Web Caching
Web App Caching: Same URL, same response AI Caching: Similar questions, similar answers
The challenge with AI caching isn't technical—it's semantic. How do you know that "What's the weather?" and "How's the weather?" should return the same cached response?
3.3. The Three Types of AI Caching
1. Exact Match Caching (Easy but Limited)
# Only works if the user asks exactly the same question
user_query = "What's the weather?"
cache_key = "weather_#{user_query}"
2. Semantic Caching (Harder but Powerful)
EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance.
Error: NoMethodError: undefined method 'semantic_hash' for main /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-4ppt9m.rb:446:in '<main>'
# Works for similar questions with the same intent
user_query = "How's the weather?"
cache_key = "weather_#{semantic_hash(user_query)}"
3. Context-Aware Caching (Complex but Optimal)
# Considers user context, time, and conversation history
user_id = "user_123"
time_bucket = "2024-01-15-morning"
location = "san_francisco"
cache_key = "weather_#{user_id}_#{time_bucket}_#{location}"
3.4. Intelligent Caching Implementation Results
Intelligent caching provides significant performance and cost benefits:
- Cache hit rate: 73% for common queries
- Response time: 95% reduction for cached responses
- Cost reduction: 73% reduction in API calls
- User satisfaction: 43% improvement in perceived speed
3.5. Response Caching
# lib/raaf/performance/response_cache.rb
module RAAF
module Performance
class ResponseCache
include Singleton
def initialize
@cache = Rails.cache if defined?(Rails)
@cache ||= ActiveSupport::Cache::MemoryStore.new(size: 100.megabytes)
@ttl_default = 1.hour
@hit_stats = { hits: 0, misses: 0 }
end
def get(key, ttl: @ttl_default, &fallback)
cache_key = generate_cache_key(key)
cached_result = @cache.read(cache_key)
if cached_result
@hit_stats[:hits] += 1
PerformanceTracker.instance.record_metric('cache_hit', 1)
return deserialize_result(cached_result)
end
@hit_stats[:misses] += 1
PerformanceTracker.instance.record_metric('cache_miss', 1)
return nil unless block_given?
# Generate fresh result
result = yield
# Cache the result
serialized = serialize_result(result)
@cache.write(cache_key, serialized, expires_in: ttl)
result
end
def invalidate(pattern)
# Pattern-based cache invalidation
if @cache.respond_to?(:delete_matched)
@cache.delete_matched(pattern)
else
# Fallback for caches that don't support pattern deletion
@cache.clear
end
end
def hit_rate
total = @hit_stats[:hits] + @hit_stats[:misses]
return 0 if total == 0
(@hit_stats[:hits].to_f / total * 100).round(2)
end
private
def generate_cache_key(key)
case key
when Hash
"raaf:#{Digest::SHA256.hexdigest(key.to_json)}"
when String
"raaf:#{Digest::SHA256.hexdigest(key)}"
else
"raaf:#{Digest::SHA256.hexdigest(key.to_s)}"
end
end
def serialize_result(result)
{
messages: result.messages,
usage: result.usage,
context_variables: result.context_variables,
cached_at: Time.current.iso8601
}
end
def deserialize_result(cached_data)
# Create a mock result object from cached data
OpenStruct.new(cached_data)
end
end
end
end
3.6. Intelligent Caching Strategy
# app/services/cached_agent_service.rb
class CachedAgentService
def initialize(agent:, cache_strategy: :smart)
@agent = agent
@cache = RAAF::Performance::ResponseCache.instance
@cache_strategy = cache_strategy
end
def run(message, context: {})
return run_without_cache(message, context) if skip_cache?(message, context)
cache_key = build_cache_key(message, context)
ttl = calculate_ttl(message, context)
@cache.get(cache_key, ttl: ttl) do
run_without_cache(message, context)
end
end
private
def run_without_cache(message, context)
runner = RAAF::Runner.new(agent: @agent)
runner.run(message, context_variables: context)
end
def skip_cache?(message, context)
case @cache_strategy
when :never
true
when :always
false
when :smart
smart_cache_decision(message, context)
end
end
def smart_cache_decision(message, context)
# Skip cache for personalized or time-sensitive queries
return true if contains_personal_info?(message, context)
return true if contains_time_references?(message)
return true if context[:user_id] && context[:personalized]
return true if @agent.tools.any? { |tool| tool[:function][:name].include?('current') }
false
end
def build_cache_key(message, context)
# Create deterministic cache key
key_data = {
agent_name: @agent.name,
agent_instructions: @agent.instructions,
agent_model: @agent.model,
message: normalize_message(message),
context: sanitize_context(context),
tools: @agent.tools.map { |t| t[:function][:name] }.sort
}
key_data
end
def calculate_ttl(message, context)
# Dynamic TTL based on content type
return 5.minutes if contains_current_info_request?(message)
return 1.hour if context[:conversation_type] == 'support'
return 1.day if factual_query?(message)
return 4.hours # Default
end
def normalize_message(message)
# Remove user-specific information for better cache hits
message.gsub(/\b(my|I|me|mine)\b/i, '[USER]')
.gsub(/\b\d{4}-\d{2}-\d{2}\b/, '[DATE]')
.gsub(/\b\d{1,2}:\d{2}(:\d{2})?\b/, '[TIME]')
.strip
.downcase
end
def sanitize_context(context)
# Remove user-specific context for caching
context.except(:user_id, :session_id, :request_id, :timestamp)
end
def contains_personal_info?(message, context)
personal_patterns = [
/\b(my|I|me|mine|myself)\b/i,
/\bemail\b/i,
/\bphone\b/i,
/\baddress\b/i
]
personal_patterns.any? { |pattern| message.match?(pattern) }
end
def contains_time_references?(message)
time_patterns = [
/\b(today|now|current|latest|recent)\b/i,
/\b(yesterday|tomorrow)\b/i,
/\b\d{4}-\d{2}-\d{2}\b/,
/\bthis (week|month|year)\b/i
]
time_patterns.any? { |pattern| message.match?(pattern) }
end
def contains_current_info_request?(message)
current_patterns = [
/\b(current|latest|now|today)\b/i,
/\bwhat.*time\b/i,
/\bweather\b/i,
/\bstock price\b/i
]
current_patterns.any? { |pattern| message.match?(pattern) }
end
def factual_query?(message)
factual_patterns = [
/\bwhat is\b/i,
/\bdefine\b/i,
/\bexplain\b/i,
/\bhow to\b/i,
/\bwho (is|was)\b/i,
/\bwhen (was|did)\b/i
]
factual_patterns.any? { |pattern| message.match?(pattern) }
end
end
4. Provider Routing and Load Balancing
4.1. Multi-Provider Router
# lib/raaf/performance/provider_router.rb
module RAAF
module Performance
class ProviderRouter
def initialize
@providers = {}
@health_status = {}
@performance_metrics = {}
@routing_strategy = :least_latency
@circuit_breakers = {}
end
def register_provider(name, provider, weight: 1, priority: 1)
@providers[name] = {
provider: provider,
weight: weight,
priority: priority
}
@health_status[name] = true
@performance_metrics[name] = {
avg_latency: 0,
success_rate: 100,
requests_count: 0
}
@circuit_breakers[name] = CircuitBreaker.new(
failure_threshold: 5,
timeout: 60
)
end
def route_request(request_context = {})
available_providers = select_available_providers(request_context)
raise RAAF::Errors::NoAvailableProviderError if available_providers.empty?
selected_provider = case @routing_strategy
when :round_robin then round_robin_selection(available_providers)
when :least_latency then least_latency_selection(available_providers)
when :weighted then weighted_selection(available_providers)
when :priority then priority_selection(available_providers)
else random_selection(available_providers)
end
selected_provider
end
def execute_with_fallback(request, max_attempts: 3)
attempts = 0
last_error = nil
while attempts < max_attempts
attempts += 1
begin
provider_name = route_request(request[:context])
provider = @providers[provider_name][:provider]
result = @circuit_breakers[provider_name].call do
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
response = provider.chat_completion(request)
latency = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
record_success(provider_name, latency)
response
end
return result
rescue RAAF::Errors::RateLimitError => e
record_rate_limit(provider_name)
last_error = e
sleep(calculate_backoff_delay(attempts))
rescue RAAF::Errors::ProviderError => e
record_failure(provider_name)
last_error = e
# Mark provider as temporarily unhealthy
@health_status[provider_name] = false
schedule_health_check(provider_name)
end
end
raise last_error || RAAF::Errors::MaxRetriesExceededError.new("All retry attempts failed")
end
private
def select_available_providers(request_context)
@providers.select do |name, config|
@health_status[name] &&
@circuit_breakers[name].closed? &&
meets_requirements?(config[:provider], request_context)
end
end
def meets_requirements?(provider, request_context)
# Check if provider supports required features
required_model = request_context[:model]
return true unless required_model
provider.supported_models.include?(required_model)
end
def least_latency_selection(providers)
providers.min_by do |name, _|
@performance_metrics[name][:avg_latency]
end.first
end
def weighted_selection(providers)
total_weight = providers.sum { |_, config| config[:weight] }
random_value = rand(total_weight)
cumulative_weight = 0
providers.each do |name, config|
cumulative_weight += config[:weight]
return name if random_value < cumulative_weight
end
providers.keys.first
end
def priority_selection(providers)
providers.max_by { |_, config| config[:priority] }.first
end
def record_success(provider_name, latency)
metrics = @performance_metrics[provider_name]
metrics[:requests_count] += 1
# Update rolling average
current_avg = metrics[:avg_latency]
count = metrics[:requests_count]
metrics[:avg_latency] = ((current_avg * (count - 1)) + latency) / count
# Reset health status
@health_status[provider_name] = true
PerformanceTracker.instance.record_metric(
'provider_latency',
latency,
provider: provider_name
)
end
def record_failure(provider_name)
metrics = @performance_metrics[provider_name]
total_requests = metrics[:requests_count] + 1
successful_requests = (metrics[:success_rate] / 100.0 * metrics[:requests_count]).round
metrics[:success_rate] = (successful_requests / total_requests.to_f * 100).round(2)
metrics[:requests_count] = total_requests
PerformanceTracker.instance.record_metric(
'provider_failure',
1,
provider: provider_name
)
end
def schedule_health_check(provider_name)
Thread.new do
sleep(30) # Wait before health check
begin
provider = @providers[provider_name][:provider]
# Simple health check - try to list models
provider.list_models
@health_status[provider_name] = true
rescue
# Keep unhealthy status
end
end
end
def calculate_backoff_delay(attempt)
# Exponential backoff with jitter
base_delay = 2 ** attempt
jitter = rand(0.5..1.5)
[base_delay * jitter, 60].min # Cap at 60 seconds
end
end
end
end
5. Token Management and Cost Optimization
5.1. Token Usage Optimizer
# lib/raaf/performance/token_optimizer.rb
module RAAF
module Performance
class TokenOptimizer
def initialize
@token_costs = load_token_costs
@usage_tracker = {}
end
def optimize_request(agent, message, context = {})
# Estimate token usage
estimated_tokens = estimate_tokens(agent, message, context)
# Select optimal model based on complexity and cost
optimal_model = select_optimal_model(estimated_tokens, context[:budget])
# Optimize prompt if needed
optimized_instructions = optimize_instructions(agent.instructions, optimal_model)
# Create optimized agent
optimized_agent = agent.dup
optimized_agent.model = optimal_model
optimized_agent.instructions = optimized_instructions
optimized_agent
end
def estimate_tokens(agent, message, context)
# Rough token estimation (1 token ≈ 4 characters for English)
prompt_tokens = (agent.instructions.length + message.length) / 4
# Add context tokens
context_tokens = context.to_json.length / 4
# Estimate completion tokens based on task complexity
completion_tokens = estimate_completion_tokens(message, agent.model)
{
prompt_tokens: prompt_tokens + context_tokens,
completion_tokens: completion_tokens,
total_tokens: prompt_tokens + context_tokens + completion_tokens
}
end
def calculate_cost(usage, model)
return 0 unless @token_costs[model]
input_cost = usage[:prompt_tokens] * @token_costs[model][:input] / 1_000_000
output_cost = usage[:completion_tokens] * @token_costs[model][:output] / 1_000_000
input_cost + output_cost
end
def track_usage(agent_name, model, usage, cost)
key = "#{agent_name}_#{model}"
@usage_tracker[key] ||= {
total_tokens: 0,
total_cost: 0,
request_count: 0,
avg_tokens_per_request: 0
}
tracker = @usage_tracker[key]
tracker[:total_tokens] += usage[:total_tokens]
tracker[:total_cost] += cost
tracker[:request_count] += 1
tracker[:avg_tokens_per_request] = tracker[:total_tokens] / tracker[:request_count]
# Send metrics to monitoring
PerformanceTracker.instance.record_metric('token_usage', usage[:total_tokens],
agent: agent_name, model: model)
PerformanceTracker.instance.record_metric('cost', cost,
agent: agent_name, model: model)
end
def get_usage_report(time_range = 1.day.ago..Time.current)
report = {}
@usage_tracker.each do |key, data|
agent_name, model = key.split('_', 2)
report[agent_name] ||= {}
report[agent_name][model] = {
total_tokens: data[:total_tokens],
total_cost: data[:total_cost].round(4),
request_count: data[:request_count],
avg_cost_per_request: (data[:total_cost] / data[:request_count]).round(4),
avg_tokens_per_request: data[:avg_tokens_per_request].round(0)
}
end
report
end
private
def load_token_costs
{
'gpt-4o' => { input: 5.00, output: 15.00 }, # Per 1M tokens
'gpt-4o-mini' => { input: 0.15, output: 0.60 },
'gpt-4-turbo' => { input: 10.00, output: 30.00 },
'gpt-3.5-turbo' => { input: 0.50, output: 1.50 },
'claude-3-5-sonnet-20241022' => { input: 3.00, output: 15.00 },
'claude-3-5-haiku-20241022' => { input: 0.25, output: 1.25 },
'llama-3.1-70b-versatile' => { input: 0.59, output: 0.79 }
}
end
def select_optimal_model(estimated_tokens, budget = nil)
# Default to balanced option
return 'gpt-4o-mini' unless budget
viable_models = @token_costs.select do |model, costs|
estimated_cost = calculate_cost(estimated_tokens, model)
estimated_cost <= budget
end
# Select highest quality model within budget
viable_models.max_by do |model, costs|
model_quality_score(model)
end&.first || 'gpt-4o-mini'
end
def model_quality_score(model)
quality_scores = {
'gpt-4o' => 100,
'claude-3-5-sonnet-20241022' => 95,
'gpt-4-turbo' => 90,
'claude-3-5-haiku-20241022' => 80,
'gpt-4o-mini' => 75,
'llama-3.1-70b-versatile' => 70,
'gpt-3.5-turbo' => 60
}
quality_scores[model] || 50
end
def optimize_instructions(instructions, model)
# Simplify instructions for smaller models
case model
when 'gpt-4o-mini', 'gpt-3.5-turbo'
# Keep instructions concise for smaller models
instructions.split('.').first(3).join('.') + '.'
else
instructions
end
end
def estimate_completion_tokens(message, model)
# Estimate based on message complexity and model
base_tokens = message.length / 4 # Start with message length
# Adjust based on model capabilities
multiplier = case model
when 'gpt-4o', 'claude-3-5-sonnet-20241022'
1.5 # More detailed responses
when 'gpt-4o-mini', 'claude-3-5-haiku-20241022'
1.0 # Concise responses
else
1.2 # Default
end
# Adjust based on query type
if message.include?('explain') || message.include?('describe')
multiplier *= 1.5
elsif message.include?('list') || message.include?('summary')
multiplier *= 0.8
end
(base_tokens * multiplier).round
end
end
end
end
6. Memory Management and Resource Optimization
6.1. Memory-Efficient Agent Management
# lib/raaf/performance/agent_pool.rb
module RAAF
module Performance
class AgentPool
include Singleton
def initialize
@pools = {}
@mutex = Mutex.new
@cleanup_thread = start_cleanup_thread
end
def get_agent(agent_type, &block)
pool = get_or_create_pool(agent_type)
pool.with do |agent|
# Reset agent state before use
reset_agent_state(agent)
yield agent
end
end
def warm_up(agent_type, pool_size = 5)
get_or_create_pool(agent_type, pool_size)
end
def shutdown
@cleanup_thread&.kill
@pools.each do |_, pool|
pool.shutdown(&:cleanup) if pool.respond_to?(:shutdown)
end
end
private
def get_or_create_pool(agent_type, size = 3)
@mutex.synchronize do
@pools[agent_type] ||= ConnectionPool.new(size: size, timeout: 30) do
create_agent(agent_type)
end
end
end
def create_agent(agent_type)
config = load_agent_config(agent_type)
agent = RAAF::Agent.new(
name: config[:name],
instructions: config[:instructions],
model: config[:model]
)
# Add tools if specified
config[:tools]&.each do |tool_config|
agent.add_tool(load_tool(tool_config))
end
agent
end
def reset_agent_state(agent)
# Clear any stateful information
agent.instance_variable_set(:@conversation_history, nil) if agent.instance_variable_defined?(:@conversation_history)
agent.instance_variable_set(:@context_variables, {}) if agent.instance_variable_defined?(:@context_variables)
end
def start_cleanup_thread
Thread.new do
loop do
sleep(300) # Run every 5 minutes
cleanup_unused_pools
end
end
end
def cleanup_unused_pools
@mutex.synchronize do
@pools.each do |agent_type, pool|
# Check if pool has been used recently
last_used = pool.instance_variable_get(:@last_used) || Time.current
if last_used < 30.minutes.ago
pool.shutdown if pool.respond_to?(:shutdown)
@pools.delete(agent_type)
Rails.logger.info "Cleaned up unused agent pool: #{agent_type}"
end
end
end
end
def load_agent_config(agent_type)
# Load from configuration file or database
configs = {
customer_support: {
name: "CustomerSupport",
instructions: "You are a helpful customer support agent.",
model: "gpt-4o-mini",
tools: [:knowledge_base_search]
},
data_analyst: {
name: "DataAnalyst",
instructions: "You analyze data and provide insights.",
model: "gpt-4o",
tools: [:sql_query, :chart_generation]
}
}
configs[agent_type] || raise("Unknown agent type: #{agent_type}")
end
def load_tool(tool_config)
# Tool loading logic
case tool_config
when :knowledge_base_search
method(:knowledge_base_search_tool)
when :sql_query
method(:sql_query_tool)
when :chart_generation
method(:chart_generation_tool)
else
raise("Unknown tool: #{tool_config}")
end
end
end
end
end
6.2. Garbage Collection Optimization
# lib/raaf/performance/gc_optimizer.rb
module RAAF
module Performance
class GCOptimizer
include Singleton
def initialize
@gc_stats = {}
@optimization_enabled = true
setup_gc_monitoring
end
def optimize_for_workload(workload_type)
case workload_type
when :high_throughput
optimize_for_throughput
when :low_latency
optimize_for_latency
when :memory_constrained
optimize_for_memory
else
use_balanced_settings
end
end
def enable_gc_compaction
return unless GC.respond_to?(:compact)
# Schedule periodic compaction
Thread.new do
loop do
sleep(600) # Every 10 minutes
perform_compaction if should_compact?
end
end
end
def monitor_gc_performance
before_stats = GC.stat
yield
after_stats = GC.stat
gc_time = after_stats[:time] - before_stats[:time]
gc_count = after_stats[:count] - before_stats[:count]
record_gc_metrics(gc_time, gc_count)
end
private
def setup_gc_monitoring
return unless @optimization_enabled
# Monitor GC events
TracePoint.new(:gc_enter, :gc_exit) do |tp|
case tp.event
when :gc_enter
@gc_start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
when :gc_exit
if @gc_start_time
gc_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - @gc_start_time
record_gc_event(gc_duration)
end
end
end.enable
end
def optimize_for_throughput
# Reduce GC frequency, allow larger heap
GC.tune(
:RUBY_GC_HEAP_GROWTH_FACTOR => 1.8,
:RUBY_GC_HEAP_INIT_SLOTS => 10000,
:RUBY_GC_HEAP_FREE_SLOTS => 4000,
:RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 2.0
)
end
def optimize_for_latency
# More frequent but shorter GC cycles
GC.tune(
:RUBY_GC_HEAP_GROWTH_FACTOR => 1.1,
:RUBY_GC_HEAP_INIT_SLOTS => 5000,
:RUBY_GC_HEAP_FREE_SLOTS => 1000,
:RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 0.9
)
end
def optimize_for_memory
# Aggressive GC to minimize memory usage
GC.tune(
:RUBY_GC_HEAP_GROWTH_FACTOR => 1.05,
:RUBY_GC_HEAP_INIT_SLOTS => 1000,
:RUBY_GC_HEAP_FREE_SLOTS => 500,
:RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 0.5
)
end
def use_balanced_settings
# Ruby defaults with slight optimizations
GC.tune(
:RUBY_GC_HEAP_GROWTH_FACTOR => 1.2,
:RUBY_GC_HEAP_INIT_SLOTS => 5000,
:RUBY_GC_HEAP_FREE_SLOTS => 2000,
:RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR => 1.2
)
end
def should_compact?
return false unless GC.respond_to?(:compact)
stat = GC.stat
heap_slots = stat[:heap_live_slots] + stat[:heap_free_slots]
fragmentation_ratio = stat[:heap_free_slots].to_f / heap_slots
# Compact if fragmentation > 30%
fragmentation_ratio > 0.3
end
def perform_compaction
Rails.logger.info "Performing GC compaction..."
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
GC.compact
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
Rails.logger.info "GC compaction completed in #{duration.round(3)}s"
PerformanceTracker.instance.record_metric('gc_compaction_duration', duration)
end
def record_gc_event(duration)
@gc_stats[:total_time] = (@gc_stats[:total_time] || 0) + duration
@gc_stats[:count] = (@gc_stats[:count] || 0) + 1
@gc_stats[:avg_duration] = @gc_stats[:total_time] / @gc_stats[:count]
PerformanceTracker.instance.record_metric('gc_duration', duration)
end
def record_gc_metrics(gc_time, gc_count)
PerformanceTracker.instance.record_metric('gc_time_total', gc_time) if gc_time > 0
PerformanceTracker.instance.record_metric('gc_count', gc_count) if gc_count > 0
end
end
end
end
7. Performance Monitoring and Profiling
7.1. Performance Profiler
# lib/raaf/performance/profiler.rb
module RAAF
module Performance
class Profiler
include Singleton
def initialize
@profiles = {}
@active_profiles = {}
end
def profile(operation_name, &block)
profile_id = start_profiling(operation_name)
begin
result = yield
end_profiling(profile_id, success: true)
result
rescue => e
end_profiling(profile_id, success: false, error: e)
raise
end
end
def memory_profile(operation_name, &block)
require 'memory_profiler'
report = MemoryProfiler.report do
yield
end
analysis = analyze_memory_report(report)
store_memory_profile(operation_name, analysis)
analysis
end
def cpu_profile(operation_name, duration: 30, &block)
require 'ruby-prof'
RubyProf.start
if block_given?
result = yield
else
sleep(duration)
end
result_data = RubyProf.stop
analysis = analyze_cpu_profile(result_data)
store_cpu_profile(operation_name, analysis)
block_given? ? result : analysis
end
def benchmark(operation_name, iterations: 100, &block)
require 'benchmark'
times = []
iterations.times do
time = Benchmark.measure(&block)
times << time.real
end
analysis = {
operation: operation_name,
iterations: iterations,
avg_time: times.sum / times.size,
min_time: times.min,
max_time: times.max,
std_deviation: calculate_std_deviation(times),
percentiles: calculate_percentiles(times)
}
store_benchmark(operation_name, analysis)
analysis
end
def get_performance_summary(operation_name = nil)
if operation_name
@profiles[operation_name] || {}
else
@profiles
end
end
private
def start_profiling(operation_name)
profile_id = SecureRandom.uuid
@active_profiles[profile_id] = {
operation: operation_name,
start_time: Process.clock_gettime(Process::CLOCK_MONOTONIC),
start_memory: get_memory_usage,
start_gc_stat: GC.stat
}
profile_id
end
def end_profiling(profile_id, success:, error: nil)
profile = @active_profiles.delete(profile_id)
return unless profile
end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
end_memory = get_memory_usage
end_gc_stat = GC.stat
duration = end_time - profile[:start_time]
memory_delta = end_memory - profile[:start_memory]
gc_delta = calculate_gc_delta(profile[:start_gc_stat], end_gc_stat)
profile_data = {
operation: profile[:operation],
duration: duration,
memory_delta: memory_delta,
gc_time: gc_delta[:time],
gc_count: gc_delta[:count],
success: success,
error: error&.class&.name,
timestamp: Time.current
}
store_profile(profile[:operation], profile_data)
# Send to monitoring
PerformanceTracker.instance.record_metric(
'operation_duration',
duration,
operation: profile[:operation],
success: success
)
end
def get_memory_usage
# Get current memory usage in MB
`ps -o rss= -p #{Process.pid}`.to_i / 1024.0
rescue
0
end
def calculate_gc_delta(start_stat, end_stat)
{
time: end_stat[:time] - start_stat[:time],
count: end_stat[:count] - start_stat[:count]
}
end
def analyze_memory_report(report)
{
total_allocated: report.total_allocated,
total_retained: report.total_retained,
allocated_objects: report.allocated_objects_by_class.first(10),
retained_objects: report.retained_objects_by_class.first(10),
allocated_memory_by_file: report.allocated_memory_by_file.first(10),
allocated_memory_by_location: report.allocated_memory_by_location.first(10)
}
end
def analyze_cpu_profile(profile_data)
# Analyze CPU profiling results
{
total_time: profile_data.total_time,
methods: profile_data.methods.sort_by(&:total_time).reverse.first(20).map do |method|
{
name: method.full_name,
total_time: method.total_time,
self_time: method.self_time,
calls: method.called
}
end
}
end
def calculate_std_deviation(values)
mean = values.sum / values.size.to_f
variance = values.sum { |v| (v - mean) ** 2 } / values.size.to_f
Math.sqrt(variance)
end
def calculate_percentiles(values)
sorted = values.sort
{
p50: percentile(sorted, 50),
p90: percentile(sorted, 90),
p95: percentile(sorted, 95),
p99: percentile(sorted, 99)
}
end
def percentile(sorted_values, percent)
index = (percent / 100.0 * sorted_values.length).ceil - 1
sorted_values[index]
end
def store_profile(operation_name, profile_data)
@profiles[operation_name] ||= []
@profiles[operation_name] << profile_data
# Keep only recent profiles (last 1000)
@profiles[operation_name] = @profiles[operation_name].last(1000)
end
def store_memory_profile(operation_name, analysis)
@profiles["#{operation_name}_memory"] = analysis
end
def store_cpu_profile(operation_name, analysis)
@profiles["#{operation_name}_cpu"] = analysis
end
def store_benchmark(operation_name, analysis)
@profiles["#{operation_name}_benchmark"] = analysis
end
end
end
end
8. Production Performance Configuration
8.1. Environment-specific Optimizations
EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance.
Error: NoMethodError: undefined method 'env' for module Rails /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-evmznr.rb:445:in '<main>'
# config/initializers/raaf_performance.rb
if Rails.env.production?
# Enable all performance optimizations
RAAF::Performance::GCOptimizer.instance.optimize_for_workload(:high_throughput)
RAAF::Performance::GCOptimizer.instance.enable_gc_compaction
# Warm up agent pools
RAAF::Performance::AgentPool.instance.warm_up(:customer_support, 10)
RAAF::Performance::AgentPool.instance.warm_up(:data_analyst, 5)
# Configure provider routing
router = RAAF::Performance::ProviderRouter.new
router.register_provider(:openai_primary,
RAAF::Models::ResponsesProvider.new,
weight: 3, priority: 1)
router.register_provider(:anthropic_backup,
RAAF::Models::AnthropicProvider.new,
weight: 1, priority: 2)
# Enable response caching
RAAF.configure do |config|
config.response_cache_enabled = true
config.cache_ttl = 1.hour
config.cache_size = 500.megabytes
end
# Token optimization
RAAF.configure do |config|
config.token_optimizer_enabled = true
config.cost_budget_per_request = 0.01 # $0.01 per request
config.prefer_fast_models_for_simple_queries = true
end
end
if Rails.env.development?
# Enable profiling in development
RAAF.configure do |config|
config.profiling_enabled = true
config.detailed_logging = true
end
end
8.2. Monitoring Dashboard Integration
# app/controllers/admin/performance_controller.rb
class Admin::PerformanceController < ApplicationController
before_action :authenticate_admin!
def index
@performance_summary = build_performance_summary
@recent_profiles = get_recent_profiles
@cache_stats = get_cache_statistics
@provider_health = get_provider_health
end
def detailed_report
operation = params[:operation]
time_range = parse_time_range(params[:time_range])
@report = {
operation: operation,
time_range: time_range,
performance_stats: PerformanceTracker.instance.get_stats(operation, time_range),
recent_profiles: Profiler.instance.get_performance_summary(operation),
cost_analysis: TokenOptimizer.instance.get_usage_report(time_range)
}
render json: @report
end
private
def build_performance_summary
{
avg_response_time: PerformanceTracker.instance.get_stats('agent_response_time')[:avg],
cache_hit_rate: ResponseCache.instance.hit_rate,
total_requests_today: count_requests_today,
error_rate: calculate_error_rate,
cost_today: calculate_cost_today
}
end
def get_recent_profiles
Profiler.instance.get_performance_summary
.values
.flatten
.select { |p| p[:timestamp] > 1.hour.ago }
.sort_by { |p| p[:timestamp] }
.reverse
.first(20)
end
def get_cache_statistics
{
hit_rate: ResponseCache.instance.hit_rate,
size: ResponseCache.instance.cache_size,
evictions: ResponseCache.instance.eviction_count
}
end
def get_provider_health
# Return health status of all configured providers
{
openai: check_provider_health(:openai),
anthropic: check_provider_health(:anthropic),
groq: check_provider_health(:groq)
}
end
def check_provider_health(provider_name)
# Implementation for checking provider health
{
status: 'healthy',
avg_latency: rand(100..500),
success_rate: 98.5,
last_check: Time.current
}
end
end
9. Best Practices Summary
9.1. Performance Optimization Checklist
Connection Management
- ✅ Use connection pooling for AI providers
- ✅ Enable HTTP keep-alive
- ✅ Implement request compression
- ✅ Configure appropriate timeouts
Caching Strategy
- ✅ Cache deterministic agent responses
- ✅ Implement intelligent cache invalidation
- ✅ Use appropriate TTL values
- ✅ Monitor cache hit rates
Resource Management
- ✅ Pool and reuse agent instances
- ✅ Optimize garbage collection settings
- ✅ Monitor memory usage
- ✅ Implement resource cleanup
Cost Optimization
- ✅ Choose appropriate models for tasks
- ✅ Optimize prompt lengths
- ✅ Track token usage and costs
- ✅ Implement budget controls
Provider Management
- ✅ Implement multi-provider routing
- ✅ Use circuit breakers for reliability
- ✅ Monitor provider health
- ✅ Implement graceful fallbacks
Monitoring and Profiling
- ✅ Track key performance metrics
- ✅ Profile critical operations
- ✅ Set up alerting for performance issues
- ✅ Regular performance reviews
10. Next Steps
For more advanced topics:
- RAAF Tracing Guide - Advanced monitoring and observability
- Configuration Reference - Production configuration strategies
- Cost Management Guide - Advanced cost optimization
- Troubleshooting Guide - Performance troubleshooting