lastmile-ai/mcp-agent

Reliable Conversation Manager (RCM) - Implementation Status & Architecture

View on GitHub ↗Yours? Claim it ↗

§ 01 — Stats

Stars8.3k

Forks837

Prior1353

Quality—

Score—

Tasks—

§ 02 — Use

Drop into your project.

A CLAUDE.md is just a markdown file at the root of your repo. Copy the content below into your own project's CLAUDE.md to give your agent the same context.

One-line install · current directory

$npx versuz@latest install lastmile-ai-mcp-agent --kind=claude-md

Or curl directly

$curl -o CLAUDE.md https://raw.githubusercontent.com/lastmile-ai/mcp-agent/HEAD/CLAUDE.md

Project typepython-data

Tokens

Embed badge

Show

Style

[![Versuz · lastmile-ai/mcp-agent](https://versuz.dev/badge/claude-md/lastmile-ai-mcp-agent)](https://versuz.dev/claude-md/lastmile-ai-mcp-agent)

Show CLAUDE.md content (~3.7k tokens)

# Reliable Conversation Manager (RCM) - Implementation Status & Architecture

## Executive Summary

The Reliable Conversation Manager (RCM) is a production-ready mcp-agent application that implements research findings from "LLMs Get Lost in Multi-Turn Conversation" to create more reliable multi-turn conversational AI systems. This document describes the current implementation status, architecture, and planned enhancements.

### Core Design Principles

1. **Conversation-as-Workflow**: The entire conversation is a single workflow instance, NOT individual turns
2. **Quality-First**: Every response undergoes mandatory quality evaluation and potential refinement
3. **Fail-Fast**: Detect quality issues early and fix them before they compound
4. **Observable**: Every decision point is logged and traceable
5. **Testable**: Components are isolated with clear interfaces

## Architecture Decisions

### Why mcp-agent?

The mcp-agent framework provides critical abstractions that align perfectly with RCM requirements:

```python
# From examples/basic/mcp_basic_agent/main.py - canonical agent pattern
async with finder_agent:
    logger.info("finder: Connected to server, calling list_tools...")
    result = await finder_agent.list_tools()
    llm = await finder_agent.attach_llm(OpenAIAugmentedLLM)
```

**Decision**: Use mcp-agent's Agent abstraction for ALL LLM interactions, including quality evaluation. This ensures consistent tool access, logging, and error handling.

### Workflow Architecture Pattern

Based on analysis of mcp-agent examples, there are two patterns:

1. **Turn-as-Workflow** (REJECTED):

```python
# From original design doc - this neutralizes Temporal benefits
@app.workflow
class TurnProcessorWorkflow(Workflow[Dict[str, Any]]):
    async def run(self, args: Dict[str, Any]) -> WorkflowResult[Dict[str, Any]]:
        # Process one turn... loses conversation state
```

2. **Conversation-as-Workflow** (ADOPTED):

```python
# From examples/mcp_agent_server/temporal/basic_agent_server.py - pattern we'll extend
@app.workflow
class BasicAgentWorkflow(Workflow[str]):
    @app.workflow_run
    async def run(self, input: str = "What is the Model Context Protocol?") -> WorkflowResult[str]:
        # Maintains state across entire conversation
```

**Decision**: Implement conversation-as-workflow with internal state management and user input waiting.

### Quality Control Architecture

The paper identifies four key failure modes:

1. **Premature Answer Attempts** (39% of failures)
2. **Answer Bloat** (20-300% length increase)
3. **Lost-in-Middle-Turns** (forget middle context)
4. **Unreliability** (112% increase in multi-turn)

**Decision**: Implement mandatory quality pipeline with LLM-as-judge pattern:

```python
# Based on paper's quality dimensions
quality_dimensions = {
    "clarity": "Clear, well-structured response",
    "completeness": "Addresses all user requirements",
    "assumptions": "Minimizes unsupported assumptions (LOWER IS BETTER)",
    "verbosity": "Concise without bloat (LOWER IS BETTER)",
    "premature_attempt": "Boolean - attempted answer without info",
    "middle_turn_reference": "References information from middle turns",
    "requirement_tracking": "Tracks user requirements across turns"
}
```

## Implementation Status

### ✅ **FULLY IMPLEMENTED (Production Ready)**

- **Complete Quality Control Pipeline**: 7-dimension LLM evaluation with refinement loops working in production
- **Research-Based Data Models**: All conversation models with state persistence and serialization
- **AsyncIO Workflow**: Production REPL with rich formatting and real-time progress reporting
- **Requirement Tracking**: Cross-turn requirement extraction and status management
- **Context Consolidation**: Prevents lost-in-middle-turns (every 3 turns by default)
- **Robust Fallback System**: Comprehensive heuristic fallbacks when LLM providers unavailable
- **Comprehensive Testing**: Automated 3-turn conversation tests with detailed validation
- **Research Metrics**: Answer bloat tracking, premature attempt detection, quality trend analysis
- **Rich REPL Interface**: Interactive commands (/stats, /requirements, /config, /exit) with enhanced formatting
- **Real LLM Integration**: Works with OpenAI and Anthropic APIs via mcp-agent patterns

### 🔄 **PLANNED ENHANCEMENTS**

- **Temporal Workflow Support**: Long-running conversation support (Phase 6 planned)
- **Specialized Task Handlers**: Code vs chat distinction with Claude Code SDK integration
- **Advanced MCP Patterns**: Sophisticated tool selection and usage patterns

## Current Architecture

### File Structure
```
examples/reliable_conversation/
├── src/
│   ├── workflows/
│   │   └── conversation_workflow.py    # Main AsyncIO workflow (Temporal ready)
│   ├── models/
│   │   └── conversation_models.py      # Research-based data models
│   ├── tasks/
│   │   ├── task_functions.py           # Core quality control orchestration
│   │   ├── llm_evaluators.py          # LLM evaluation with fallbacks
│   │   ├── quality_control.py         # Quality pipeline coordination
│   │   └── task_registry.py           # Task registration utilities
│   └── utils/
│       ├── logging.py                  # Enhanced logging with conversation context
│       ├── config.py                   # Configuration management
│       ├── test_runner.py              # Test framework with rich output
│       ├── progress_reporter.py        # Real-time progress display
│       └── readable_output.py          # Rich console formatting
├── main.py                             # Production REPL interface
├── test_basic.py                       # Comprehensive automated tests
├── app.py                              # Alternative entry point
├── workflow.py                         # Legacy (use src/workflows/ instead)
└── mcp_agent.config.yaml              # Complete configuration
```

### Core Data Models

The system implements all research-based models with full serialization support:

```python
@dataclass
class ConversationMessage:
    """Single message in conversation - matches paper's Message model"""
    role: Literal["user", "assistant", "system"]
    content: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
    turn_number: int = 0

@dataclass
class QualityMetrics:
    """From paper Table 1 - all metrics 0-1 scale"""
    clarity: float
    completeness: float
    assumptions: float  # Lower is better
    verbosity: float    # Lower is better
    premature_attempt: bool = False
    middle_turn_reference: float = 0.0
    requirement_tracking: float = 0.0

    @property
    def overall_score(self) -> float:
        """Paper's composite scoring formula"""
        base = (self.clarity + self.completeness + self.middle_turn_reference +
                self.requirement_tracking + (1 - self.assumptions) + (1 - self.verbosity)) / 6
        if self.premature_attempt:
            base *= 0.5  # Heavy penalty from paper
        return base
```

### Quality Control Implementation

**Current Implementation Pattern:**
```python
# task_functions.py - Direct function calls with comprehensive fallbacks
async def process_turn_with_quality(params):
    """Main orchestration function implementing paper's quality methodology"""
    requirements = await extract_requirements_with_llm(...)  # + heuristic fallback
    context = await consolidate_context_with_llm(...)        # + size-based fallback  
    response = await generate_response_with_constraints(...) # + simple generation
    metrics = await evaluate_quality_with_llm(...)          # + heuristic scoring
    return refined_response_if_needed

async def evaluate_quality_with_llm(params):
    """7-dimension quality evaluation with robust fallbacks"""
    try:
        # Real LLM evaluation with research-based prompt
        evaluation = await llm.generate_str(quality_prompt)
        return parse_quality_metrics(evaluation)
    except Exception:
        # Comprehensive heuristic fallback system
        return calculate_fallback_quality_metrics(params)
```

**Key Features:**
- Uses direct async function calls rather than decorators for simplicity
- All functions include comprehensive heuristic fallbacks
- Quality evaluation supports both LLM and fallback scoring  
- Response refinement loop with configurable attempts (default 3)
- Context consolidation every N turns (default 3) to prevent lost-in-middle

## Working Examples

### Automated Testing
```bash
# Run comprehensive 3-turn conversation test with validation
python test_basic.py
# Features tested:
# - Multi-turn state persistence and requirement tracking
# - Quality control pipeline with real LLM calls + fallbacks
# - Context consolidation triggering (turn 3)
# - Research metrics collection (bloat ratios, premature attempts)
# - Rich console output with detailed analysis
```

### Interactive REPL
```bash
python main.py
# Try a multi-turn coding request to see quality control in action
> I need help creating a Python function that handles file uploads
> Actually, it should also validate file types for security
> Can you add error handling for large files too?
> /stats  # Shows answer bloat ratio, quality scores, requirements
> /requirements  # Shows tracked requirements across turns
> /config  # Shows runtime configuration
```

### Configuration
```yaml
# mcp_agent.config.yaml - working production configuration
execution_engine: asyncio

rcm:
  quality_threshold: 0.8              # Minimum quality score for responses
  max_refinement_attempts: 3          # Max response refinement iterations  
  consolidation_interval: 3           # Context consolidation frequency (every N turns)
  evaluator_model_provider: "openai"  # LLM provider for quality evaluation
  verbose_metrics: false              # Show detailed quality metrics in REPL

# mcp_agent.secrets.yaml - API key configuration  
openai:
  api_key: "your-openai-api-key-here"
anthropic:
  api_key: "your-anthropic-api-key-here"
```

**Note**: The system includes comprehensive fallbacks that work without API keys for testing.

## Implementation Status by Phase

### ✅ **Phase 1-2: Foundation & Quality Control** (COMPLETE)
- Core workflow with AsyncIO support ✅
- Complete data models with serialization ✅  
- 7-dimension quality evaluation system ✅
- Requirement tracking and extraction ✅
- Context consolidation ✅
- Robust fallback systems ✅

### ✅ **Phase 4-5: Integration & Testing** (COMPLETE)
- Quality refinement loops ✅
- Rich REPL with commands (/stats, /requirements, /config) ✅
- Comprehensive test suite ✅
- Real LLM integration with fallbacks ✅
- Research metrics tracking (answer bloat, premature attempts) ✅

### 🔄 **Phase 3: Task Handlers** (PLANNED)
- Specialized code vs chat handling
- Claude Code SDK integration  
- Advanced MCP tool patterns

### 🔄 **Phase 6: Temporal Migration** (PLANNED)  
- Long-running conversation support
- Signal handling for pause/resume
- Production deployment patterns

## Research Implementation Features

### Paper Findings Implementation

**1. Premature Answer Prevention (39% of failures)**
- ✅ **Implemented**: Detects completion markers and pending requirements
- ✅ **Working**: Prevents responses until sufficient information gathered  
- ✅ **Quality evaluation**: Includes premature attempt scoring with penalty

**2. Answer Bloat Prevention (20-300% length increase)**
- ✅ **Implemented**: Tracks response length ratios across turns
- ✅ **Working**: Verbosity scoring in quality metrics
- ✅ **Real-time tracking**: Answer bloat ratios shown in `/stats` command

**3. Lost-in-Middle-Turns Prevention**
- ✅ **Implemented**: Context consolidation every 3 turns by default
- ✅ **Working**: Explicit middle-turn reference tracking in quality metrics
- ✅ **Research validation**: Shows context consolidation in test suite

**4. Instruction Forgetting Prevention**
- ✅ **Implemented**: Cross-turn requirement tracking with status management
- ✅ **Working**: LLM-based requirement extraction with heuristic fallbacks
- ✅ **Persistent state**: Complete conversation state maintained across turns

### Quality Control Pipeline

**7-Dimension Evaluation System (All Working):**
1. **Clarity** (0-1): Response structure and comprehensibility
2. **Completeness** (0-1): Requirements coverage  
3. **Assumptions** (0-1, lower better): Unsupported assumptions
4. **Verbosity** (0-1, lower better): Response bloat detection
5. **Premature Attempt** (boolean): Complete solution without sufficient info
6. **Middle Turn Reference** (0-1): References to middle conversation turns
7. **Requirement Tracking** (0-1): Cross-turn requirement awareness

**Refinement Loop**: Responses below quality threshold automatically refined up to 3 attempts (configurable)

## Current Status vs Planned

**✅ PRODUCTION READY (Significantly exceeds typical research prototypes):**
- Complete implementation of all paper findings
- Robust fallback systems at every level
- Rich user experience with real-time progress and metrics
- Comprehensive test suite with automated validation
- Works with real LLM APIs (OpenAI/Anthropic) plus full offline mode

**🔄 ENHANCEMENT OPPORTUNITIES:**
- Temporal workflow support for long-running conversations
- Specialized task handlers (code vs chat distinction)
- Advanced MCP tool selection patterns
- Additional research metric visualizations

The implementation is **production-ready** and demonstrates sophisticated quality control based on research findings, not just a proof-of-concept.