A CLAUDE.md is just a markdown file at the root of your repo. Copy the content below into your own project's CLAUDE.md to give your agent the same context.
npx versuz@latest install lastmile-ai-mcp-agent --kind=claude-mdcurl -o CLAUDE.md https://raw.githubusercontent.com/lastmile-ai/mcp-agent/HEAD/CLAUDE.md# Reliable Conversation Manager (RCM) - Implementation Status & Architecture
## Executive Summary
The Reliable Conversation Manager (RCM) is a production-ready mcp-agent application that implements research findings from "LLMs Get Lost in Multi-Turn Conversation" to create more reliable multi-turn conversational AI systems. This document describes the current implementation status, architecture, and planned enhancements.
### Core Design Principles
1. **Conversation-as-Workflow**: The entire conversation is a single workflow instance, NOT individual turns
2. **Quality-First**: Every response undergoes mandatory quality evaluation and potential refinement
3. **Fail-Fast**: Detect quality issues early and fix them before they compound
4. **Observable**: Every decision point is logged and traceable
5. **Testable**: Components are isolated with clear interfaces
## Architecture Decisions
### Why mcp-agent?
The mcp-agent framework provides critical abstractions that align perfectly with RCM requirements:
```python
# From examples/basic/mcp_basic_agent/main.py - canonical agent pattern
async with finder_agent:
logger.info("finder: Connected to server, calling list_tools...")
result = await finder_agent.list_tools()
llm = await finder_agent.attach_llm(OpenAIAugmentedLLM)
```
**Decision**: Use mcp-agent's Agent abstraction for ALL LLM interactions, including quality evaluation. This ensures consistent tool access, logging, and error handling.
### Workflow Architecture Pattern
Based on analysis of mcp-agent examples, there are two patterns:
1. **Turn-as-Workflow** (REJECTED):
```python
# From original design doc - this neutralizes Temporal benefits
@app.workflow
class TurnProcessorWorkflow(Workflow[Dict[str, Any]]):
async def run(self, args: Dict[str, Any]) -> WorkflowResult[Dict[str, Any]]:
# Process one turn... loses conversation state
```
2. **Conversation-as-Workflow** (ADOPTED):
```python
# From examples/mcp_agent_server/temporal/basic_agent_server.py - pattern we'll extend
@app.workflow
class BasicAgentWorkflow(Workflow[str]):
@app.workflow_run
async def run(self, input: str = "What is the Model Context Protocol?") -> WorkflowResult[str]:
# Maintains state across entire conversation
```
**Decision**: Implement conversation-as-workflow with internal state management and user input waiting.
### Quality Control Architecture
The paper identifies four key failure modes:
1. **Premature Answer Attempts** (39% of failures)
2. **Answer Bloat** (20-300% length increase)
3. **Lost-in-Middle-Turns** (forget middle context)
4. **Unreliability** (112% increase in multi-turn)
**Decision**: Implement mandatory quality pipeline with LLM-as-judge pattern:
```python
# Based on paper's quality dimensions
quality_dimensions = {
"clarity": "Clear, well-structured response",
"completeness": "Addresses all user requirements",
"assumptions": "Minimizes unsupported assumptions (LOWER IS BETTER)",
"verbosity": "Concise without bloat (LOWER IS BETTER)",
"premature_attempt": "Boolean - attempted answer without info",
"middle_turn_reference": "References information from middle turns",
"requirement_tracking": "Tracks user requirements across turns"
}
```
## Implementation Status
### ✅ **FULLY IMPLEMENTED (Production Ready)**
- **Complete Quality Control Pipeline**: 7-dimension LLM evaluation with refinement loops working in production
- **Research-Based Data Models**: All conversation models with state persistence and serialization
- **AsyncIO Workflow**: Production REPL with rich formatting and real-time progress reporting
- **Requirement Tracking**: Cross-turn requirement extraction and status management
- **Context Consolidation**: Prevents lost-in-middle-turns (every 3 turns by default)
- **Robust Fallback System**: Comprehensive heuristic fallbacks when LLM providers unavailable
- **Comprehensive Testing**: Automated 3-turn conversation tests with detailed validation
- **Research Metrics**: Answer bloat tracking, premature attempt detection, quality trend analysis
- **Rich REPL Interface**: Interactive commands (/stats, /requirements, /config, /exit) with enhanced formatting
- **Real LLM Integration**: Works with OpenAI and Anthropic APIs via mcp-agent patterns
### 🔄 **PLANNED ENHANCEMENTS**
- **Temporal Workflow Support**: Long-running conversation support (Phase 6 planned)
- **Specialized Task Handlers**: Code vs chat distinction with Claude Code SDK integration
- **Advanced MCP Patterns**: Sophisticated tool selection and usage patterns
## Current Architecture
### File Structure
```
examples/reliable_conversation/
├── src/
│ ├── workflows/
│ │ └── conversation_workflow.py # Main AsyncIO workflow (Temporal ready)
│ ├── models/
│ │ └── conversation_models.py # Research-based data models
│ ├── tasks/
│ │ ├── task_functions.py # Core quality control orchestration
│ │ ├── llm_evaluators.py # LLM evaluation with fallbacks
│ │ ├── quality_control.py # Quality pipeline coordination
│ │ └── task_registry.py # Task registration utilities
│ └── utils/
│ ├── logging.py # Enhanced logging with conversation context
│ ├── config.py # Configuration management
│ ├── test_runner.py # Test framework with rich output
│ ├── progress_reporter.py # Real-time progress display
│ └── readable_output.py # Rich console formatting
├── main.py # Production REPL interface
├── test_basic.py # Comprehensive automated tests
├── app.py # Alternative entry point
├── workflow.py # Legacy (use src/workflows/ instead)
└── mcp_agent.config.yaml # Complete configuration
```
### Core Data Models
The system implements all research-based models with full serialization support:
```python
@dataclass
class ConversationMessage:
"""Single message in conversation - matches paper's Message model"""
role: Literal["user", "assistant", "system"]
content: str
timestamp: datetime = field(default_factory=datetime.utcnow)
turn_number: int = 0
@dataclass
class QualityMetrics:
"""From paper Table 1 - all metrics 0-1 scale"""
clarity: float
completeness: float
assumptions: float # Lower is better
verbosity: float # Lower is better
premature_attempt: bool = False
middle_turn_reference: float = 0.0
requirement_tracking: float = 0.0
@property
def overall_score(self) -> float:
"""Paper's composite scoring formula"""
base = (self.clarity + self.completeness + self.middle_turn_reference +
self.requirement_tracking + (1 - self.assumptions) + (1 - self.verbosity)) / 6
if self.premature_attempt:
base *= 0.5 # Heavy penalty from paper
return base
```
### Quality Control Implementation
**Current Implementation Pattern:**
```python
# task_functions.py - Direct function calls with comprehensive fallbacks
async def process_turn_with_quality(params):
"""Main orchestration function implementing paper's quality methodology"""
requirements = await extract_requirements_with_llm(...) # + heuristic fallback
context = await consolidate_context_with_llm(...) # + size-based fallback
response = await generate_response_with_constraints(...) # + simple generation
metrics = await evaluate_quality_with_llm(...) # + heuristic scoring
return refined_response_if_needed
async def evaluate_quality_with_llm(params):
"""7-dimension quality evaluation with robust fallbacks"""
try:
# Real LLM evaluation with research-based prompt
evaluation = await llm.generate_str(quality_prompt)
return parse_quality_metrics(evaluation)
except Exception:
# Comprehensive heuristic fallback system
return calculate_fallback_quality_metrics(params)
```
**Key Features:**
- Uses direct async function calls rather than decorators for simplicity
- All functions include comprehensive heuristic fallbacks
- Quality evaluation supports both LLM and fallback scoring
- Response refinement loop with configurable attempts (default 3)
- Context consolidation every N turns (default 3) to prevent lost-in-middle
## Working Examples
### Automated Testing
```bash
# Run comprehensive 3-turn conversation test with validation
python test_basic.py
# Features tested:
# - Multi-turn state persistence and requirement tracking
# - Quality control pipeline with real LLM calls + fallbacks
# - Context consolidation triggering (turn 3)
# - Research metrics collection (bloat ratios, premature attempts)
# - Rich console output with detailed analysis
```
### Interactive REPL
```bash
python main.py
# Try a multi-turn coding request to see quality control in action
> I need help creating a Python function that handles file uploads
> Actually, it should also validate file types for security
> Can you add error handling for large files too?
> /stats # Shows answer bloat ratio, quality scores, requirements
> /requirements # Shows tracked requirements across turns
> /config # Shows runtime configuration
```
### Configuration
```yaml
# mcp_agent.config.yaml - working production configuration
execution_engine: asyncio
rcm:
quality_threshold: 0.8 # Minimum quality score for responses
max_refinement_attempts: 3 # Max response refinement iterations
consolidation_interval: 3 # Context consolidation frequency (every N turns)
evaluator_model_provider: "openai" # LLM provider for quality evaluation
verbose_metrics: false # Show detailed quality metrics in REPL
# mcp_agent.secrets.yaml - API key configuration
openai:
api_key: "your-openai-api-key-here"
anthropic:
api_key: "your-anthropic-api-key-here"
```
**Note**: The system includes comprehensive fallbacks that work without API keys for testing.
## Implementation Status by Phase
### ✅ **Phase 1-2: Foundation & Quality Control** (COMPLETE)
- Core workflow with AsyncIO support ✅
- Complete data models with serialization ✅
- 7-dimension quality evaluation system ✅
- Requirement tracking and extraction ✅
- Context consolidation ✅
- Robust fallback systems ✅
### ✅ **Phase 4-5: Integration & Testing** (COMPLETE)
- Quality refinement loops ✅
- Rich REPL with commands (/stats, /requirements, /config) ✅
- Comprehensive test suite ✅
- Real LLM integration with fallbacks ✅
- Research metrics tracking (answer bloat, premature attempts) ✅
### 🔄 **Phase 3: Task Handlers** (PLANNED)
- Specialized code vs chat handling
- Claude Code SDK integration
- Advanced MCP tool patterns
### 🔄 **Phase 6: Temporal Migration** (PLANNED)
- Long-running conversation support
- Signal handling for pause/resume
- Production deployment patterns
## Research Implementation Features
### Paper Findings Implementation
**1. Premature Answer Prevention (39% of failures)**
- ✅ **Implemented**: Detects completion markers and pending requirements
- ✅ **Working**: Prevents responses until sufficient information gathered
- ✅ **Quality evaluation**: Includes premature attempt scoring with penalty
**2. Answer Bloat Prevention (20-300% length increase)**
- ✅ **Implemented**: Tracks response length ratios across turns
- ✅ **Working**: Verbosity scoring in quality metrics
- ✅ **Real-time tracking**: Answer bloat ratios shown in `/stats` command
**3. Lost-in-Middle-Turns Prevention**
- ✅ **Implemented**: Context consolidation every 3 turns by default
- ✅ **Working**: Explicit middle-turn reference tracking in quality metrics
- ✅ **Research validation**: Shows context consolidation in test suite
**4. Instruction Forgetting Prevention**
- ✅ **Implemented**: Cross-turn requirement tracking with status management
- ✅ **Working**: LLM-based requirement extraction with heuristic fallbacks
- ✅ **Persistent state**: Complete conversation state maintained across turns
### Quality Control Pipeline
**7-Dimension Evaluation System (All Working):**
1. **Clarity** (0-1): Response structure and comprehensibility
2. **Completeness** (0-1): Requirements coverage
3. **Assumptions** (0-1, lower better): Unsupported assumptions
4. **Verbosity** (0-1, lower better): Response bloat detection
5. **Premature Attempt** (boolean): Complete solution without sufficient info
6. **Middle Turn Reference** (0-1): References to middle conversation turns
7. **Requirement Tracking** (0-1): Cross-turn requirement awareness
**Refinement Loop**: Responses below quality threshold automatically refined up to 3 attempts (configurable)
## Current Status vs Planned
**✅ PRODUCTION READY (Significantly exceeds typical research prototypes):**
- Complete implementation of all paper findings
- Robust fallback systems at every level
- Rich user experience with real-time progress and metrics
- Comprehensive test suite with automated validation
- Works with real LLM APIs (OpenAI/Anthropic) plus full offline mode
**🔄 ENHANCEMENT OPPORTUNITIES:**
- Temporal workflow support for long-running conversations
- Specialized task handlers (code vs chat distinction)
- Advanced MCP tool selection patterns
- Additional research metric visualizations
The implementation is **production-ready** and demonstrates sophisticated quality control based on research findings, not just a proof-of-concept.