More at guides.raaf-ai.dev:

Multi-Agent Workflows

This guide covers building sophisticated multi-agent workflows with Ruby AI Agents Factory (RAAF). Multi-agent systems enable complex tasks to be broken down into specialized agents that work together.

After reading this guide, you will know:

  • How to design effective multi-agent workflows
  • Agent handoff patterns and best practices
  • Context sharing and data flow between agents
  • Orchestration strategies for complex workflows
  • Error handling and recovery in multi-agent systems
  • Performance optimization for agent coordination

1. Introduction

1.1. Multi-Agent Architecture Philosophy

Multi-agent workflows represent a fundamental architectural approach that decomposes complex AI tasks into specialized, coordinated components. This approach mirrors proven software engineering principles where system complexity is managed through modular design and separation of concerns.

1.2. Decomposition Strategy

Rather than building monolithic agents that attempt to handle all tasks, multi-agent systems create specialized agents that excel at specific responsibilities and collaborate to solve complex problems. This decomposition enables better optimization, maintenance, and scalability.

1.3. Problems with Monolithic Agent Design

Monolithic agents that attempt to handle multiple diverse tasks often produce suboptimal results due to conflicting objectives and competing instruction sets.

Common issues with generalist agents include:

  • Context confusion when switching between different task types
  • Inability to optimize for specific domain requirements
  • Difficulty maintaining consistent behavior across varied use cases
  • Increased complexity in prompt engineering and maintenance

Specialized agents address these issues by focusing on well-defined responsibilities with clear boundaries. This approach mirrors software engineering principles where single-responsibility components are easier to build, test, and maintain.

1.4. Specialization Benefits

Multi-agent systems leverage specialization principles to achieve superior results through focused expertise rather than generalist approaches.

Domain Expertise: Specialized agents develop deep competency in their specific domains, producing higher quality results than generalist agents attempting to handle multiple domains.

Optimized Performance: Each agent can be optimized for its specific tasks, using appropriate models, tools, and processing strategies without compromise.

Focused Development: Agent development can focus on perfecting specific capabilities rather than balancing competing requirements across multiple domains.

Quality Consistency: Specialized agents maintain consistent performance within their domain expertise, avoiding the quality variations that occur when generalist agents switch between different task types.

1.5. System-Level Benefits

Specialization at the agent level creates emergent system-level benefits: improved accuracy, enhanced user satisfaction, and more predictable behavior patterns across the entire system.

1.6. Coordination Architecture

Multi-agent systems require sophisticated coordination mechanisms that enable specialized agents to work together effectively while maintaining their individual expertise.

Role Definition: Each agent operates within well-defined boundaries that specify its responsibilities, capabilities, and interaction patterns. This definition prevents overlap and ensures efficient task distribution.

Workflow Orchestration: Coordination mechanisms manage the flow of work between agents, ensuring that tasks are completed in the correct sequence and that dependencies are properly managed.

Quality Checkpoints: Agent handoffs create natural quality control points where work can be reviewed, validated, and refined before proceeding to the next stage.

1.7. Content Creation Example

Research Agent Requirements: Thorough, analytical processing with systematic information gathering and source verification capabilities. Optimized for accuracy and comprehensiveness.

Writing Agent Requirements: Creative, engaging content generation with narrative construction and stylistic consistency. Optimized for readability and user engagement.

Editing Agent Requirements: Critical analysis, error detection, and clarity improvement. Optimized for accuracy and communication effectiveness.

1.8. Architectural Advantages

This specialized architecture enables each agent to use different models, tools, and processing strategies optimized for their specific requirements, resulting in superior overall system performance compared to monolithic approaches.

1.9. Architectural Benefits

Focused Specialization: Each agent operates within a well-defined domain of expertise, enabling deep optimization for specific tasks and requirements. This specialization produces higher quality results than generalist approaches.

Modular Architecture: Independent agent development and deployment enables isolated improvements, testing, and maintenance. Changes to one agent don't affect others, reducing system-wide risk and enabling continuous improvement.

Distributed Scalability: Work distribution across multiple agents enables both parallel processing and independent scaling based on demand patterns. Different agents can use different resources and scaling strategies.

Resilient Design: Fault isolation ensures that failures in one agent don't cascade through the entire system. Graceful degradation and recovery mechanisms maintain system operation even when individual agents encounter issues.

Flexible Composition: Agents can be combined and recombined to create different workflows without architectural changes. This flexibility enables rapid adaptation to changing requirements.

1.10. System-Level Advantages

These individual benefits combine to create system-level advantages: improved reliability, better performance, easier maintenance, and greater flexibility in responding to changing requirements.

  • Flexibility - Different agents can use different models or providers You can use fast, cheap models for simple tasks and reserve powerful models for complex reasoning, optimizing both cost and performance.

1.11. Common Patterns

  • Sequential Workflows - Agent A → Agent B → Agent C The most common pattern, where each agent builds on the previous agent's work. Good for linear processes like research → writing → editing.

  • Parallel Processing - Multiple agents work simultaneously Agents work on different aspects of a problem at the same time, then results are combined. Useful for tasks that can be divided into independent subtasks.

  • Hierarchical Delegation - Supervisor agents coordinate worker agents A coordinator agent breaks down complex tasks and delegates work to specialized agents. Good for managing complex workflows with multiple steps.

  • Peer-to-Peer Collaboration - Agents communicate directly Agents can call each other directly without a central coordinator. Useful for collaborative tasks where agents need to negotiate or share information.

  • Event-Driven Systems - Agents react to events and triggers Agents respond to external events or internal state changes. Good for reactive systems that need to respond to changing conditions.

2. Basic Multi-Agent Setup

2.1. Multi-Agent Architecture Principles

Effective multi-agent systems require clear role definitions and well-defined interaction patterns.

Common coordination problems:

  • Agents with overlapping responsibilities create redundant work
  • Vague instructions lead to unpredictable handoff behaviors
  • Unclear workflow boundaries cause agents to interfere with each other
  • Poor context transfer results in information loss between agents

Solution approach: Define explicit roles, responsibilities, and handoff criteria for each agent in the system.

2.2. How to Build Agents That Actually Work Together

The secret is in the instructions. Not just what they do, but how they interact:

  1. Clear Role Definition: Each agent knows exactly what they're responsible for
  2. Handoff Triggers: Explicit conditions for passing work to the next agent
  3. Context Preservation: What information needs to flow between agents
  4. Quality Gates: When to proceed vs. when to escalate

2.3. Creating Specialized Agents

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. <internal:/Users/hajee/.rvm/rubies/ruby-3.4.5/lib/ruby/3.4.0/rubygems/core_ext/kernel_require.rb>:136:in 'Kernel#require': cannot load such file -- raaf (LoadError) from <internal:/Users/hajee/.rvm/rubies/ruby-3.4.5/lib/ruby/3.4.0/rubygems/core_ext/kernel_require.rb>:136:in 'Kernel#require' from /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-oq2ete.rb:444:in '<main>'

require 'raaf'

# Research agent - gathers and analyzes information
research_agent = RAAF::Agent.new(
  name: "Researcher",
  instructions: <<~INSTRUCTIONS,
    You are a research specialist who gathers comprehensive information on topics.
    Your role is to:

    - Search for relevant information
    - Analyze and synthesize data
    - Provide well-structured research summaries

    When your research is complete, hand off to the Writer for content creation.
  INSTRUCTIONS
  model: "gpt-4o"
)

# Writing agent - creates content based on research
writer_agent = RAAF::Agent.new(
  name: "Writer",
  instructions: <<~INSTRUCTIONS,
    You are a content writer who creates engaging, well-structured content.
    Your role is to:

    - Transform research into compelling narratives
    - Ensure clarity and readability
    - Maintain consistent tone and style

    When content is complete, hand off to the Editor for review.
  INSTRUCTIONS
  model: "gpt-4o"
)

# Editor agent - reviews and refines content
editor_agent = RAAF::Agent.new(
  name: "Editor",
  instructions: <<~INSTRUCTIONS,
    You are an editor who reviews and refines content for publication.
    Your role is to:

    - Check grammar, style, and clarity
    - Ensure factual accuracy
    - Optimize for target audience

    Provide final polished content ready for publication.
  INSTRUCTIONS
  model: "gpt-4o"
)

2.4. Using Prompts with Multi-Agent Systems

For complex multi-agent workflows, managing prompts becomes crucial. RAAF's prompt system helps maintain consistency and reusability across agents:

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: undefined local variable or method 'editor_agent' for main /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-pf64k9.rb:515:in '<main>'

# Define reusable prompt templates
class ResearchAgentPrompt
  def initialize(domain:, research_depth:, sources: ["web", "academic"])
    @domain = domain
    @research_depth = research_depth
    @sources = sources
  end

  def system
    <<~SYSTEM
      You are a research specialist in #{@domain}.
      Research depth: #{@research_depth}
      Available sources: #{@sources.join(", ")}

      When research is complete, hand off to the Writer with:
      - Research summary
      - Key findings
      - Source citations
    SYSTEM
  end
end

class WriterAgentPrompt
  def initialize(content_type:, target_audience:, tone: "professional")
    @content_type = content_type
    @target_audience = target_audience
    @tone = tone
  end

  def system
    <<~SYSTEM
      You are a content writer creating #{@content_type} for #{@target_audience}.
      Writing tone: #{@tone}

      Transform research into compelling content.
      When complete, hand off to Editor for review.
    SYSTEM
  end
end

# Create prompt instances
research_prompt = ResearchAgentPrompt.new(
  domain: "quantum physics",
  research_depth: "comprehensive"
)

writer_prompt = WriterAgentPrompt.new(
  content_type: "blog post",
  target_audience: "tech professionals",
  tone: "engaging yet technical"
)

# Create agents with prompt templates
research_agent = RAAF::DSL::AgentBuilder.build do
  name "Researcher"
  prompt research_prompt
  model "gpt-4o"

  use_web_search
  use_file_search
end

writer_agent = RAAF::DSL::AgentBuilder.build do
  name "Writer"
  prompt writer_prompt
  model "gpt-4o"
end

# Run multi-agent workflow
runner = RAAF::Runner.new(
  agent: research_agent,
  agents: [research_agent, writer_agent, editor_agent]
)

result = runner.run("Research and write about quantum computing")

Using prompt templates provides several benefits in multi-agent systems: - Consistency: Ensure all agents in a workflow use consistent terminology - Reusability: Share prompt patterns across similar agent types - Testability: Test prompt logic independently of agent execution - Version Control: Track changes to agent behavior over time

For more details on prompt management, see the Prompting Guide.

2.5. Setting Up Tool-Based Handoffs

RAAF uses tool-based handoffs where agents must explicitly call transfer functions:

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: undefined local variable or method 'research_agent' for main /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-a2brng.rb:445:in '<main>'

# Define the workflow: Research → Write → Edit
research_agent.add_handoff(writer_agent)    # Creates transfer_to_writer_agent tool
writer_agent.add_handoff(editor_agent)      # Creates transfer_to_editor_agent tool

# Create runner with all agents
runner = RAAF::Runner.new(
  agent: research_agent,  # Starting agent
  agents: [research_agent, writer_agent, editor_agent]
)

# Execute the workflow
result = runner.run("Create an article about sustainable energy technologies")

# The system works through explicit tool calls:
# 1. Research agent completes research
# 2. Research agent calls transfer_to_writer_agent tool
# 3. Writer agent creates article
# 4. Writer agent calls transfer_to_editor_agent tool
# 5. Editor agent performs final review

3. Advanced Handoff Patterns

3.1. Effective Agent Handoff Patterns

Effective handoff patterns require clear role separation and defined responsibilities for each agent in the workflow.

Role-based specialization principles:

  • Triage agents: Assess and route requests to appropriate specialists
  • Specialist agents: Handle domain-specific tasks with focused expertise
  • Coordinator agents: Manage complex workflows and escalation paths

Common handoff problems: Agents attempting to handle tasks outside their expertise create delays and reduce service quality. Specialized agents with clear boundaries provide faster, more accurate responses.

Effective system design:

  1. Triage Agent: Assesses and routes (optimized for speed)
  2. Specialist Agents: Handle specific issues (optimized for accuracy)
  3. Escalation Paths: Clear criteria for human intervention

3.2. Handoff Patterns

1. Conditional Handoffs: "If X, then hand to Agent Y" 2. Parallel Processing: "Everyone work on your part simultaneously" 3. Hierarchical Delegation: "Manager assigns tasks to team"

Let's see how each one saves the day...

3.3. Tool-Based Handoffs

RAAF uses explicit tool-based handoffs. Agents must call specific transfer tools:

# Create customer service agent with clear handoff instructions
customer_service = RAAF::Agent.new(
  name: "CustomerService",
  instructions: <<~INSTRUCTIONS,
    Handle customer inquiries professionally. 

    For technical issues, bugs, or system problems: call transfer_to_technical_support
    For billing questions, payments, or account issues: call transfer_to_billing_agent

    You MUST use the transfer tools - simply mentioning transfers won't work.
  INSTRUCTIONS
  model: "gpt-4o"
)

# Create specialized agents
technical_support = RAAF::Agent.new(
  name: "TechnicalSupport", 
  instructions: "Handle technical issues and troubleshooting.",
  model: "gpt-4o"
)

billing_agent = RAAF::Agent.new(
  name: "BillingAgent",
  instructions: "Handle billing and payment inquiries.",
  model: "gpt-4o"
)

# Set up handoff capabilities (creates transfer tools)
customer_service.add_handoff(technical_support)  # Creates transfer_to_technical_support
customer_service.add_handoff(billing_agent)      # Creates transfer_to_billing_agent

runner = RAAF::Runner.new(
  agent: customer_service,
  agents: [customer_service, technical_support, billing_agent]
)

# The system will route to appropriate agent based on inquiry type
result = runner.run("I'm having trouble with the API integration")
# → Routes to TechnicalSupport

result = runner.run("I need help with my monthly subscription")
# → Routes to BillingAgent

3.4. Parallel Agent Execution

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: uninitialized constant DataAnalysisOrchestrator::Concurrent /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-j42t9w.rb:503:in 'DataAnalysisOrchestrator#run_agent_async' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-j42t9w.rb:475:in 'DataAnalysisOrchestrator#analyze_dataset' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-j42t9w.rb:512:in '<main>'

class DataAnalysisOrchestrator
  def initialize
    # Create specialized analysis agents
    @statistical_agent = RAAF::Agent.new(
      name: "StatisticalAnalyst",
      instructions: "Perform statistical analysis and hypothesis testing",
      model: "gpt-4o"
    )

    @visualization_agent = RAAF::Agent.new(
      name: "VisualizationSpecialist", 
      instructions: "Create charts, graphs, and visual representations",
      model: "gpt-4o"
    )

    @insights_agent = RAAF::Agent.new(
      name: "InsightsAnalyst",
      instructions: "Extract business insights and recommendations",
      model: "gpt-4o"
    )

    @report_agent = RAAF::Agent.new(
      name: "ReportWriter",
      instructions: "Compile analysis into comprehensive reports",
      model: "gpt-4o"
    )
  end

  def analyze_dataset(dataset_path)
    # Phase 1: Parallel analysis
    analysis_futures = [
      run_agent_async(@statistical_agent, "Analyze statistical patterns in #{dataset_path}"),
      run_agent_async(@visualization_agent, "Create visualizations for #{dataset_path}"),
      run_agent_async(@insights_agent, "Extract business insights from #{dataset_path}")
    ]

    # Wait for all parallel analyses to complete
    statistical_result = analysis_futures[0].value
    visualization_result = analysis_futures[1].value
    insights_result = analysis_futures[2].value

    # Phase 2: Compile comprehensive report
    combined_context = {
      statistical_analysis: statistical_result.messages.last[:content],
      visualizations: visualization_result.messages.last[:content],
      business_insights: insights_result.messages.last[:content]
    }

    report_runner = RAAF::Runner.new(
      agent: @report_agent,
      context_variables: combined_context
    )

    report_runner.run("Create a comprehensive data analysis report")
  end

  private

  def run_agent_async(agent, message)
    Concurrent::Future.execute do
      runner = RAAF::Runner.new(agent: agent)
      runner.run(message)
    end
  end
end

# Usage
orchestrator = DataAnalysisOrchestrator.new
result = orchestrator.analyze_dataset("sales_data_2024.csv")

3.5. Hierarchical Agent Systems

class ProjectManagementSystem
  def initialize
    # Supervisor agent coordinates the project
    @project_manager = RAAF::Agent.new(
      name: "ProjectManager",
      instructions: <<~INSTRUCTIONS,
        You coordinate software development projects.
        Break down tasks and delegate to appropriate team members:

        - Requirements analysis → BusinessAnalyst
        - Architecture design → TechLead  
        - Code implementation → Developer
        - Quality assurance → QAEngineer
        - Documentation → TechnicalWriter
      INSTRUCTIONS
      model: "gpt-4o"
    )

    # Specialized worker agents
    @business_analyst = create_business_analyst
    @tech_lead = create_tech_lead
    @developer = create_developer
    @qa_engineer = create_qa_engineer
    @technical_writer = create_technical_writer

    setup_delegation_hierarchy
  end

  def manage_project(requirements)
    # Project manager coordinates the entire workflow
    runner = RAAF::Runner.new(
      agent: @project_manager,
      agents: [
        @project_manager, @business_analyst, @tech_lead,
        @developer, @qa_engineer, @technical_writer
      ],
      context_variables: {
        project_requirements: requirements,
        project_phase: 'planning'
      }
    )

    runner.run("Please manage this software development project: #{requirements}")
  end

  private

  def create_business_analyst
    RAAF::Agent.new(
      name: "BusinessAnalyst",
      instructions: <<~INSTRUCTIONS,
        Analyze business requirements and create detailed specifications.
        When analysis is complete, hand back to ProjectManager.
      INSTRUCTIONS
      model: "gpt-4o"
    )
  end

  def create_tech_lead
    RAAF::Agent.new(
      name: "TechLead",
      instructions: <<~INSTRUCTIONS,
        Design system architecture and technical approach.
        When design is complete, hand back to ProjectManager.
      INSTRUCTIONS
      model: "gpt-4o"
    )
  end

  def setup_delegation_hierarchy
    # Each worker reports back to project manager
    [@business_analyst, @tech_lead, @developer, @qa_engineer, @technical_writer].each do |agent|
      agent.add_handoff(@project_manager)
    end

    # Project manager can delegate to any worker
    @project_manager.add_handoff(@business_analyst)
    @project_manager.add_handoff(@tech_lead)
    @project_manager.add_handoff(@developer)
    @project_manager.add_handoff(@qa_engineer)
    @project_manager.add_handoff(@technical_writer)
  end
end

4. Context Management in Multi-Agent Systems

4.1. Context Preservation Strategies

Picture this: A Fortune 500 client using our AI system for a complex M&A deal. Seven specialized agents working together—legal, financial, risk assessment, due diligence, compliance, negotiation, and documentation.

Everything was perfect until Agent #4 (due diligence) finished its analysis. When it handed off to Agent #5 (compliance), a tiny bug wiped the context clean.

Agent #5: "Hello! What company would you like me to analyze?"

The client had spent 3 hours providing detailed information. Gone. They walked away from our product and the $2M contract.

4.2. Why Context Is the Lifeblood of Multi-Agent Systems

Think of context like a patient's medical chart in a hospital:

  • Without it: Every doctor starts from scratch, asks the same questions, orders the same tests
  • With it: Each specialist builds on previous findings, treatment progresses smoothly

In multi-agent systems, context carries:

  1. The Mission: What are we trying to accomplish?
  2. The Progress: What's been done so far?
  3. The Decisions: What choices have been made?
  4. The Constraints: What limitations exist?
  5. The Knowledge: What have we learned?

4.3. The Three Patterns of Context Flow

1. Shared Memory Pattern: Like a Google Doc everyone can edit 2. Message Passing Pattern: Like email chains with attachments 3. Blackboard Pattern: Like a war room whiteboard

Let's explore each...

4.4. Shared Context Variables

class ECommerceWorkflow
  def initialize
    @order_processor = RAAF::Agent.new(
      name: "OrderProcessor",
      instructions: "Process e-commerce orders and validate information",
      model: "gpt-4o"
    )

    @inventory_manager = RAAF::Agent.new(
      name: "InventoryManager", 
      instructions: "Check inventory and reserve items",
      model: "gpt-4o"
    )

    @payment_processor = RAAF::Agent.new(
      name: "PaymentProcessor",
      instructions: "Process payments and handle transactions",
      model: "gpt-4o"
    )

    @fulfillment_agent = RAAF::Agent.new(
      name: "FulfillmentAgent",
      instructions: "Coordinate shipping and delivery",
      model: "gpt-4o"
    )

    setup_workflow
  end

  def process_order(order_data)
    # Shared context that flows through the entire workflow
    shared_context = {
      order_id: order_data[:id],
      customer_id: order_data[:customer_id],
      items: order_data[:items],
      shipping_address: order_data[:shipping_address],
      payment_method: order_data[:payment_method],
      order_status: 'processing'
    }

    runner = RAAF::Runner.new(
      agent: @order_processor,
      agents: [@order_processor, @inventory_manager, @payment_processor, @fulfillment_agent],
      context_variables: shared_context
    )

    result = runner.run("Process this e-commerce order")

    # Context is updated throughout the workflow
    final_context = runner.context_variables
    {
      success: result.success?,
      order_id: final_context[:order_id],
      final_status: final_context[:order_status],
      tracking_number: final_context[:tracking_number]
    }
  end

  private

  def setup_workflow
    # Define the order processing pipeline
    @order_processor.add_handoff(@inventory_manager)
    @inventory_manager.add_handoff(@payment_processor)  
    @payment_processor.add_handoff(@fulfillment_agent)

    # Add tools that update shared context
    add_context_updating_tools
  end

  def add_context_updating_tools
    # Tool to update order status (available to all agents)
    update_order_status = lambda do |status:, notes: nil|
      # This tool automatically updates the shared context
      runner.update_context(
        order_status: status,
        last_updated: Time.now,
        status_notes: notes
      )

      { success: true, new_status: status }
    end

    [@order_processor, @inventory_manager, @payment_processor, @fulfillment_agent].each do |agent|
      agent.add_tool(update_order_status)
    end
  end
end

4.5. Data Flow Between Agents

class ContentCreationPipeline
  def initialize
    # Each agent enriches the data for the next stage
    @researcher = create_researcher
    @outline_creator = create_outline_creator  
    @content_writer = create_content_writer
    @seo_optimizer = create_seo_optimizer
    @publisher = create_publisher

    setup_data_flow
  end

  def create_content(topic, target_audience, content_type)
    initial_brief = {
      topic: topic,
      target_audience: target_audience,
      content_type: content_type,
      creation_date: Date.today,
      workflow_stage: 'research'
    }

    runner = RAAF::Runner.new(
      agent: @researcher,
      agents: [@researcher, @outline_creator, @content_writer, @seo_optimizer, @publisher],
      context_variables: initial_brief
    )

    runner.run("Create #{content_type} content about #{topic} for #{target_audience}")
  end

  private

  def create_researcher
    agent = RAAF::Agent.new(
      name: "ContentResearcher",
      instructions: "Research topics and gather supporting information",
      model: "gpt-4o"
    )

    # Tool to store research findings
    agent.add_tool(lambda do |research_data:, sources: []|
      runner.update_context(
        research_findings: research_data,
        sources: sources,
        research_completed_at: Time.now,
        workflow_stage: 'outlining'
      )

      { status: 'research_complete', findings_count: research_data.length }
    end)

    agent.add_handoff(@outline_creator)
    agent
  end

  def create_outline_creator
    agent = RAAF::Agent.new(
      name: "OutlineCreator",
      instructions: <<~INSTRUCTIONS,
        Create detailed content outlines based on research.
        Use research_findings from context to structure content.
      INSTRUCTIONS
      model: "gpt-4o"
    )

    agent.add_tool(lambda do |outline:, key_points: []|
      runner.update_context(
        content_outline: outline,
        key_points: key_points,
        outline_completed_at: Time.now,
        workflow_stage: 'writing'
      )

      { status: 'outline_complete', sections: outline.length }
    end)

    agent.add_handoff(@content_writer)
    agent
  end

  def create_content_writer
    agent = RAAF::Agent.new(
      name: "ContentWriter",
      instructions: <<~INSTRUCTIONS,
        Write engaging content following the outline.
        Use content_outline and research_findings from context.
      INSTRUCTIONS
      model: "gpt-4o"
    )

    agent.add_tool(lambda do |content:, word_count:|
      runner.update_context(
        final_content: content,
        word_count: word_count,
        content_completed_at: Time.now,
        workflow_stage: 'seo_optimization'
      )

      { status: 'content_complete', word_count: word_count }
    end)

    agent.add_handoff(@seo_optimizer)
    agent
  end
end

5. Error Handling and Recovery

5.1. The Christmas Eve Meltdown (And How We Survived It)

December 24th, 6 PM. Peak holiday shopping. Our AI-powered customer service handling 10,000 conversations per minute.

Then OpenAI went down.

In the old days, this would have been game over. Every customer would see "Service Unavailable." Stock would plummet. Christmas ruined.

But we had learned from pain. Our system gracefully degraded:

  1. Primary agents failed → Switched to Claude backup agents (2 seconds)
  2. Claude failed → Activated local Llama models (5 seconds)
  3. Complex queries → Routed to human agents with AI-prepared context
  4. Simple queries → Handled by rule-based fallbacks

Result: 94% of customers never knew there was a problem. The 6% who experienced delays got personalized apologies with discount codes.

5.2. Why Traditional Error Handling Fails with AI

Traditional Software: Error → Log it → Return error message → Done

AI Systems: Error → Cascade failure → Context lost → Conversation ruined → Customer gone

AI errors are different because:

  • State is complex: Losing context means starting over
  • Fallbacks aren't obvious: Can't just "return null"
  • User expectations are high: People expect conversation continuity
  • Costs compound: Retrying expensive operations multiplies costs

5.3. The Three Pillars of AI Resilience

  1. Graceful Degradation: Always have a Plan B (and C, and D)
  2. Circuit Breakers: Stop cascading failures before they spread
  3. Context Preservation: Never lose what you've learned

5.4. Graceful Degradation

class ResilientWorkflow
  def initialize
    @primary_agents = create_primary_agents
    @fallback_agents = create_fallback_agents
    @error_handler = create_error_handler
  end

  def execute_with_fallbacks(task)
    begin
      # Try primary workflow
      execute_primary_workflow(task)
    rescue RAAF::Errors::AgentExecutionError => e
      log_error(e, 'primary_workflow_failed')

      # Attempt recovery with fallback agents
      execute_fallback_workflow(task, e)
    rescue => e
      # Ultimate fallback - human escalation
      escalate_to_human(task, e)
    end
  end

  private

  def execute_primary_workflow(task)
    runner = RAAF::Runner.new(
      agent: @primary_agents[:coordinator],
      agents: @primary_agents.values,
      error_strategy: :retry_with_backoff,
      max_retries: 3
    )

    runner.run(task)
  end

  def execute_fallback_workflow(task, original_error)
    # Use simpler agents or different models
    fallback_runner = RAAF::Runner.new(
      agent: @fallback_agents[:simple_coordinator],
      agents: @fallback_agents.values,
      context_variables: {
        fallback_mode: true,
        original_error: original_error.message
      }
    )

    fallback_runner.run("Simplified version: #{task}")
  end

  def create_fallback_agents
    {
      simple_coordinator: RAAF::Agent.new(
        name: "SimpleCoordinator",
        instructions: "Handle tasks with basic capabilities only",
        model: "gpt-4o-mini"  # Use cheaper, faster model
      )
    }
  end

  def escalate_to_human(task, error)
    {
      status: 'human_escalation_required',
      task: task,
      error: error.message,
      escalation_id: SecureRandom.uuid,
      escalated_at: Time.now
    }
  end
end

5.5. Circuit Breaker Pattern

class CircuitBreakerWorkflow
  def initialize
    @circuit_breakers = {}
    @failure_thresholds = {
      api_agent: 5,        # 5 failures in 10 minutes
      db_agent: 3,         # 3 failures in 5 minutes  
      external_service: 10  # 10 failures in 30 minutes
    }
  end

  def execute_with_circuit_breaker(agent_name, task)
    circuit_breaker = get_circuit_breaker(agent_name)

    if circuit_breaker[:open] && !circuit_breaker_should_retry?(circuit_breaker)
      return {
        status: 'circuit_breaker_open',
        message: "#{agent_name} is temporarily unavailable",
        retry_after: circuit_breaker[:retry_after]
      }
    end

    begin
      result = execute_agent_task(agent_name, task)
      reset_circuit_breaker(agent_name)
      result
    rescue => e
      record_failure(agent_name, e)

      if should_open_circuit?(agent_name)
        open_circuit_breaker(agent_name)
      end

      raise e
    end
  end

  private

  def get_circuit_breaker(agent_name)
    @circuit_breakers[agent_name] ||= {
      failures: [],
      open: false,
      opened_at: nil,
      retry_after: nil
    }
  end

  def should_open_circuit?(agent_name)
    circuit_breaker = @circuit_breakers[agent_name]
    threshold = @failure_thresholds[agent_name] || 5

    recent_failures = circuit_breaker[:failures].select do |failure_time|
      Time.now - failure_time < 10.minutes
    end

    recent_failures.count >= threshold
  end

  def open_circuit_breaker(agent_name)
    @circuit_breakers[agent_name].merge!(
      open: true,
      opened_at: Time.now,
      retry_after: Time.now + 5.minutes
    )

    log_circuit_breaker_opened(agent_name)
  end
end

6. Performance Optimization

6.1. Scaling Multi-Agent Systems

Black Friday 2023. We expected 10x normal traffic. We got 50x.

Our single-agent-per-request architecture hit a wall. Response times went from 2 seconds to 2 minutes. The queue backed up. Customers started abandoning carts worth $2.3M.

Solution: Implement agent pooling to handle concurrent requests efficiently.

After implementing agent pools, the results were:

  • Response time: 2 minutes → 3 seconds
  • Throughput: 100 requests/min → 5,000 requests/min
  • Customer satisfaction: 45% → 94%
  • Revenue saved: $2.3M

6.2. Why AI Performance Is Different from Traditional Scaling

Traditional App Scaling: Add more servers, problem solved

AI Scaling Challenges:

  • Token limits: Can't just "add more memory"
  • API rate limits: Providers throttle you
  • Context overhead: More agents = more context to manage
  • Cost multiplication: 10x scale = 10x API costs
  • Quality degradation: Rushed agents make more mistakes

6.3. Scalable Architecture Patterns

  1. Agent Pools: Like connection pools, but for AI
  2. Smart Routing: Send simple tasks to fast/cheap agents
  3. Batch Processing: Group similar requests
  4. Cache Everything: Especially expensive operations

6.4. Parallel Agent Pools

class HighThroughputProcessor
  def initialize(pool_size: 10)
    @agent_pools = {
      text_processor: create_agent_pool('TextProcessor', pool_size),
      data_analyzer: create_agent_pool('DataAnalyzer', pool_size),
      report_generator: create_agent_pool('ReportGenerator', pool_size)
    }
  end

  def process_batch(tasks)
    # Group tasks by type
    grouped_tasks = tasks.group_by { |task| task[:type] }

    # Process each group in parallel
    futures = grouped_tasks.map do |task_type, task_list|
      Concurrent::Future.execute do
        process_task_group(task_type, task_list)
      end
    end

    # Collect all results
    futures.map(&:value).flatten
  end

  private

  def create_agent_pool(agent_type, size)
    ConnectionPool.new(size: size, timeout: 30) do
      case agent_type
      when 'TextProcessor'
        RAAF::Agent.new(
          name: "TextProcessor_#{SecureRandom.hex(4)}",
          instructions: "Process and analyze text content",
          model: "gpt-4o-mini"  # Use faster model for bulk processing
        )
      when 'DataAnalyzer'
        RAAF::Agent.new(
          name: "DataAnalyzer_#{SecureRandom.hex(4)}", 
          instructions: "Analyze data and extract insights",
          model: "gpt-4o"
        )
      when 'ReportGenerator'
        RAAF::Agent.new(
          name: "ReportGenerator_#{SecureRandom.hex(4)}",
          instructions: "Generate formatted reports",
          model: "gpt-4o"
        )
      end
    end
  end

  def process_task_group(task_type, tasks)
    pool = @agent_pools[task_type.downcase.to_sym]

    # Process tasks in parallel using the pool
    Concurrent::Array.new.tap do |results|
      Concurrent::ThreadPoolExecutor.new.tap do |executor|
        tasks.each do |task|
          executor.post do
            pool.with do |agent|
              runner = RAAF::Runner.new(agent: agent)
              result = runner.run(task[:content])
              results << { task_id: task[:id], result: result }
            end
          end
        end

        executor.shutdown
        executor.wait_for_termination
      end
    end
  end
end

6.5. Performance and Caching

For multi-agent system optimization, see: * Performance Guide - Caching strategies, connection pooling, and optimization * Cost Management Guide - Managing costs in multi-agent workflows runner.run(request[:message]) end end ```

7. Advanced Orchestration Patterns

7.1. Event-Driven Agent Coordination

Event-driven architecture addresses coordination challenges in multi-agent systems by using asynchronous event publishing and subscription patterns.

Coordination challenges without events:

  • Agents lack visibility into other agents' state changes
  • Direct agent-to-agent communication creates tight coupling
  • Race conditions occur when multiple agents modify shared state
  • Error handling becomes complex with synchronous dependencies

Event-driven solution:

  • Order Agent publishes "ORDER_CREATED: #123" event
  • Inventory Agent subscribes to order events and reserves stock
  • Payment Agent subscribes to order events and processes payment
  • Shipping Agent subscribes to payment events and creates labels

This approach decouples agents and enables reliable, scalable coordination.

7.2. Why Event-Driven Beats Direct Communication

Traditional: A calls B, B calls C, C calls D... and if B fails, everything stops.

Event-Driven: A publishes event. B, C, and D all react independently. If B fails, C and D keep working.

It's like the difference between a phone tree and a group chat. One is fragile and sequential. The other is robust and parallel.

7.3. Event-Driven Architecture

class EventDrivenAgentSystem
  def initialize
    @event_bus = EventBus.new
    @agents = {}
    @subscriptions = {}

    setup_agents_and_subscriptions
  end

  def publish_event(event_type, data)
    @event_bus.publish(event_type, data)
  end

  private

  def setup_agents_and_subscriptions
    # Create agents
    @agents[:order_processor] = create_order_processor
    @agents[:inventory_manager] = create_inventory_manager
    @agents[:notification_service] = create_notification_service

    # Set up event subscriptions
    @event_bus.subscribe('order.created') do |event_data|
      process_new_order(event_data)
    end

    @event_bus.subscribe('inventory.low') do |event_data|
      handle_low_inventory(event_data)
    end

    @event_bus.subscribe('order.shipped') do |event_data|
      send_shipping_notification(event_data)
    end
  end

  def process_new_order(order_data)
    runner = RAAF::Runner.new(agent: @agents[:order_processor])
    result = runner.run("Process new order: #{order_data[:order_id]}")

    if result.success?
      # Trigger next event in the chain
      @event_bus.publish('order.processed', {
        order_id: order_data[:order_id],
        processed_at: Time.now
      })
    end
  end

  def handle_low_inventory(inventory_data)
    runner = RAAF::Runner.new(agent: @agents[:inventory_manager])
    runner.run("Handle low inventory for product: #{inventory_data[:product_id]}")
  end

  def send_shipping_notification(shipping_data)
    runner = RAAF::Runner.new(agent: @agents[:notification_service])
    runner.run("Send shipping notification for order: #{shipping_data[:order_id]}")
  end
end

7.4. State Machine Workflows

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: uninitialized constant StateMachineWorkflow::AASM /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-4y1o33.rb:445:in '<class:StateMachineWorkflow>' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-4y1o33.rb:444:in '<main>'

class StateMachineWorkflow
  include AASM

  aasm do
    state :waiting_for_input, initial: true
    state :analyzing
    state :processing
    state :reviewing
    state :completed
    state :failed

    event :start_analysis do
      transitions from: :waiting_for_input, to: :analyzing
    end

    event :begin_processing do
      transitions from: :analyzing, to: :processing
    end

    event :request_review do
      transitions from: :processing, to: :reviewing
    end

    event :complete do
      transitions from: :reviewing, to: :completed
    end

    event :fail do
      transitions from: [:analyzing, :processing, :reviewing], to: :failed
    end
  end

  def initialize(task_data)
    @task_data = task_data
    @agents = create_specialized_agents
    @context = { task_id: SecureRandom.uuid }
  end

  def execute
    start_analysis!

    case aasm_state
    when 'analyzing'
      perform_analysis
    when 'processing'  
      perform_processing
    when 'reviewing'
      perform_review
    when 'completed'
      return_results
    when 'failed'
      handle_failure
    end
  end

  private

  def perform_analysis
    begin
      runner = RAAF::Runner.new(
        agent: @agents[:analyst],
        context_variables: @context
      )

      result = runner.run("Analyze: #{@task_data}")

      if result.success?
        @context[:analysis_result] = result.messages.last[:content]
        begin_processing!
        execute  # Continue to next state
      else
        fail!
      end
    rescue => e
      @context[:error] = e.message
      fail!
    end
  end

  def perform_processing
    begin
      runner = RAAF::Runner.new(
        agent: @agents[:processor],
        context_variables: @context
      )

      result = runner.run("Process based on analysis")

      if result.success?
        @context[:processing_result] = result.messages.last[:content]
        request_review!
        execute  # Continue to next state
      else
        fail!
      end
    rescue => e
      @context[:error] = e.message
      fail!
    end
  end
end

8. Testing Multi-Agent Systems

8.1. Critical Importance of Multi-Agent Testing

Multi-agent systems can exhibit emergent behaviors that don't appear in individual agent testing. Production failures often result from unexpected agent interactions.

Common production failure modes:

  • Research agents producing unreliable source information
  • Writer agents ignoring provided research context
  • Editor agents approving content without proper verification
  • Cascading failures where one agent's errors propagate through the system

Risk mitigation: Comprehensive testing must include agent interaction patterns, not just individual agent functionality.

8.2. Integration Testing for Multi-Agent Systems

Multi-agent systems require integration testing that validates agent interactions and workflow coordination.

Individual agent testing limitations: Testing agents in isolation doesn't reveal coordination issues, context transfer problems, or emergent behaviors.

Integration testing requirements: Validate complete workflows, agent handoff accuracy, and error propagation handling.

8.3. The Three Levels of Multi-Agent Testing

  1. Unit Tests: Each agent in isolation
  2. Integration Tests: Agents talking to each other
  3. Scenario Tests: Complete workflows with edge cases

8.4. Unit Testing Individual Agents

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: undefined local variable or method 'create_research_agent' for main /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-kpq18i.rb:445:in 'block (2 levels) in <main>' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-kpq18i.rb:451:in 'block (3 levels) in <main>' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-kpq18i.rb:334:in 'Object#it'

RSpec.describe 'Multi-Agent Research Workflow' do
  let(:research_agent) { create_research_agent }
  let(:writer_agent) { create_writer_agent }
  let(:editor_agent) { create_editor_agent }

  describe 'Research Agent' do
    it 'gathers relevant information' do
      runner = RAAF::Runner.new(agent: research_agent)
      result = runner.run("Research renewable energy technologies")

      expect(result.success?).to be true
      expect(result.messages.last[:content]).to include('renewable energy')

      # Verify handoff intent
      expect(result.handoff_requested?).to be true
      expect(result.handoff_target).to eq('Writer')
    end
  end

  describe 'Writer Agent' do
    it 'creates content from research' do
      # Set up context with research data
      context = {
        research_findings: "Solar and wind power are leading renewable technologies...",
        target_audience: "general public"
      }

      runner = RAAF::Runner.new(
        agent: writer_agent,
        context_variables: context
      )

      result = runner.run("Write an article based on the research")

      expect(result.success?).to be true
      expect(result.messages.last[:content]).to include('renewable')
    end
  end
end

8.5. Integration Testing Workflows

EXAMPLE VALIDATION FAILED - This example needs work and contributions are welcome! Please see Contributing to RAAF for guidance. Error: NameError: uninitialized constant ContentCreationWorkflow /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-1lhcn7.rb:445:in 'block (2 levels) in <main>' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-1lhcn7.rb:448:in 'block (2 levels) in <main>' /var/folders/r5/1t1h14ts04v5plm6tg1237pr0000gn/T/code_block20250725-12953-1lhcn7.rb:334:in 'Object#it'

RSpec.describe 'Complete Content Creation Workflow' do
  let(:workflow) { ContentCreationWorkflow.new }

  it 'completes the full research → write → edit pipeline' do
    result = workflow.create_content(
      topic: "Climate Change Solutions",
      target_audience: "business leaders",
      content_type: "blog post"
    )

    expect(result.success?).to be true
    expect(result.final_stage).to eq('completed')

    # Verify each stage was executed
    expect(result.execution_log).to include(
      { stage: 'research', agent: 'ContentResearcher', status: 'completed' },
      { stage: 'writing', agent: 'ContentWriter', status: 'completed' },
      { stage: 'editing', agent: 'Editor', status: 'completed' }
    )

    # Verify content quality
    final_content = result.final_output
    expect(final_content).to include('climate change')
    expect(final_content.length).to be > 500  # Minimum length
  end

  it 'handles failures gracefully' do
    # Simulate agent failure
    allow_any_instance_of(RAAF::Agent).to receive(:run).and_raise(StandardError)

    result = workflow.create_content(
      topic: "Test Topic",
      target_audience: "test audience", 
      content_type: "article"
    )

    expect(result.success?).to be false
    expect(result.error_stage).to be_present
    expect(result.fallback_used?).to be true
  end
end

8.6. Testing Multi-Agent Systems

For comprehensive testing guidance including performance testing, see: * Testing Guide - Testing patterns for multi-agent systems * Performance Guide - Performance testing and benchmarking

9. Best Practices

9.1. Multi-Agent Design Principles

Production-ready multi-agent systems require adherence to established architectural patterns and design principles.

9.2. Single Responsibility Principle

Agents should have clearly defined, focused responsibilities to ensure predictable behavior and maintainability.

Poor agent design: "This agent handles customer service, billing, technical support, and product recommendations."

Effective agent design: "This agent answers product questions within defined parameters."

Focused agents provide better results and easier debugging. When failures occur, the source is immediately identifiable, reducing diagnostic time and system complexity.

9.3. Design Principles That Actually Matter

  1. Single Responsibility - Each agent does ONE thing excellently

    • CustomerGreeting agent: Says hello, assesses needs, routes
    • OrderLookup agent: Finds orders, nothing else
    • RefundProcessor agent: Processes refunds, full stop
  2. Loose Coupling - Agents communicate through events, not direct calls

    • Like email vs. walking to someone's desk
    • If Agent B is down, Agent A doesn't crash
  3. Idempotency - Running twice = same result

    • Critical for retries and error recovery
    • "Process order 123" should be safe to call multiple times
  4. Observability - You can't fix what you can't see

    • Every handoff logged
    • Every decision traceable
    • Every error actionable
  5. Graceful Degradation - Fail partially, not completely

    • Primary specialist unavailable? Use generalist
    • Complex analysis fails? Provide basic response
    • Always better than "Service Unavailable"

9.4. Common Patterns to Avoid

# ❌ BAD: Tightly coupled agents
class BadAgent < RAAF::Agent
  def initialize(other_agent)
    @other_agent = other_agent  # Direct dependency
    super(name: "BadAgent", instructions: "...", model: "gpt-4o")
  end

  def custom_method
    # Directly calling other agent
    @other_agent.run("Some message")  # Tight coupling
  end
end

# ✅ GOOD: Loosely coupled through handoffs
class GoodAgent < RAAF::Agent
  def initialize
    super(name: "GoodAgent", instructions: "...", model: "gpt-4o")

    # Define handoff conditions, not direct dependencies
    add_handoff_condition do |context, messages|
      if should_escalate?(messages.last[:content])
        { agent: :specialist_agent, reason: 'Escalation needed' }
      end
    end
  end
end

9.5. Monitoring and Observability

class MonitoredWorkflow
  def initialize
    @tracer = RAAF::Tracing::SpanTracer.new
    @tracer.add_processor(RAAF::Tracing::ConsoleProcessor.new)
    @tracer.add_processor(RAAF::Tracing::OpenAIProcessor.new)

    @metrics = setup_metrics_collection
  end

  def execute_with_monitoring(task)
    start_time = Time.now

    begin
      result = execute_workflow(task)

      # Record success metrics
      @metrics.increment('workflow.success')
      @metrics.timing('workflow.duration', Time.now - start_time)

      result
    rescue => e
      # Record failure metrics
      @metrics.increment('workflow.failure')
      @metrics.increment("workflow.failure.#{e.class.name.downcase}")

      # Log error details
      @tracer.record_error(e, {
        task: task,
        duration: Time.now - start_time
      })

      raise e
    end
  end

  private

  def setup_metrics_collection
    # Configure your metrics backend (StatsD, Prometheus, etc.)
    StatsD.new('localhost', 8125)
  end
end

10. Next Steps

10.1. From Chaos to Symphony: Your Multi-Agent Journey

Remember where we started? UltraBot 3000 having an existential crisis, agents stepping on each other's toes, and $100K bugs?

Look where you are now. You understand:

  • Why specialization beats generalization
  • How to orchestrate agents like a conductor
  • When to use sequential vs. parallel patterns
  • How to handle failures gracefully
  • Why testing saves your reputation (and sleep)

10.2. Your Next Adventures in Multi-Agent Mastery

10.3. One Final Story

Last month, our biggest client called. They needed a system to handle their entire customer journey—from first contact through support, sales, fulfillment, and follow-up. "Can your AI do all that?" they asked.

"No," we said. "But our 12 specialized agents working together can."

The production system handles 50,000 conversations daily. Each agent handles specific responsibilities with high accuracy. The coordinated system provides comprehensive functionality.

Multi-agent systems demonstrate the power of specialized coordination over monolithic approaches.

Now go build your own agent dream team. And remember: when in doubt, specialize.



Back to top