Building Multi-Agent AI Systems for Banking: Simple Task Automation with CrewAI (Part 2)

Hands-on implementation of a basic fraud detection agent system with step-by-step code walkthrough

In Part 1, we explored the theory behind agentic AI and why multi-agent systems matter in banking. Today, we build something real. By the end of this guide, you will have a working fraud detection system with three specialized agents collaborating to analyze suspicious transactions.

This is not a theoretical exercise. The code you write today follows the same patterns I use in production systems that process millions of transactions. The difference is scale, not architecture.

We will start with environment setup, build three agents with distinct responsibilities, give them tools to query data and calculate risk, orchestrate their collaboration, and run the complete workflow. You will see exactly how agents reason, make decisions, and produce actionable outputs. We will also compare performance across different LLM providers so you understand the tradeoffs between OpenAI, Anthropic, and open-source models.

If you have not read Part 1, I recommend starting there to understand the foundational concepts. This article assumes you know what agents, tasks, and crews are.

Let’s build.

Environment Setup and Configuration

Before writing any agent code, we need a clean Python environment with the right dependencies. I always start projects with a virtual environment to avoid package conflicts.

Creating the Project Structure

Open your terminal and create a new directory for this project.

mkdir fraud-detection-crew
cd fraud-detection-crew
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Installing Dependencies

We need CrewAI, LangChain integrations for multiple LLM providers, and supporting libraries.

pip install crewai==0.28.8 crewai-tools==0.1.6 
pip install langchain-openai langchain-anthropic langchain-community
pip install python-dotenv pandas

Why these specific packages? CrewAI is the core framework. The langchain packages provide LLM integrations. Python-dotenv manages API keys securely. Pandas helps with data manipulation for our fraud analysis.

Configuring API Keys

Create a .env file in your project directory to store API keys. Never hardcode keys in your source files.

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here

If you are using open-source models like Llama through Ollama, you do not need API keys, but you need Ollama installed and running locally. We will cover that shortly.

Project File Structure

Create this structure to keep code organized:

fraud-detection-crew/
├── .env
├── main.py
├── agents.py
├── tasks.py
├── tools.py
└── data/
    └── sample_transactions.json

This separation makes the code maintainable. Agents go in one file, tasks in another, tools in a third. The main script orchestrates everything.

Configuring Multiple LLM Providers

One of CrewAI’s strengths is LLM flexibility. Let me show you how to configure OpenAI, Anthropic Claude, and Llama so you can switch between them easily.

Create a file called llm_config.py:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_community.llms import Ollama

load_dotenv()
def get_openai_llm(model="gpt-4", temperature=0.1):
    """
    Configure OpenAI models.
    Best for: Complex reasoning, following detailed instructions
    Cost: Higher ($0.03/1K tokens for GPT-4)
    """
    return ChatOpenAI(
        model=model,
        temperature=temperature,
        api_key=os.getenv("OPENAI_API_KEY")
    )
def get_anthropic_llm(model="claude-3-5-sonnet-20241022", temperature=0.1):
    """
    Configure Anthropic Claude models.
    Best for: Long context, nuanced analysis, compliance review
    Cost: Moderate ($0.003/1K tokens for Sonnet)
    """
    return ChatAnthropic(
        model=model,
        temperature=temperature,
        api_key=os.getenv("ANTHROPIC_API_KEY")
    )
def get_ollama_llm(model="llama3.1:8b", temperature=0.1):
    """
    Configure Ollama for local models.
    Best for: Privacy-sensitive workloads, high volume, no per-token cost
    Cost: Free (requires local compute)
    Note: Requires Ollama installed and running
    """
    return Ollama(
        model=model,
        temperature=temperature
    )
# Default LLM for quick testing
def get_default_llm():
    """Returns GPT-4 by default, but you can change this"""
    return get_openai_llm(model="gpt-4", temperature=0.1)

This configuration makes it trivial to swap models. Change one function call and your entire crew uses a different LLM. In production, I often use GPT-4 for the complex reasoning agent, Claude for compliance review, and Llama for data extraction where privacy matters.

Building Custom Tools for Transaction Analysis

Agents need tools to interact with data and perform calculations. Let’s build three tools our fraud detection agents will use.

Create tools.py:

from crewai_tools import tool
import json
import random
from datetime import datetime, timedelta

@tool("Get Transaction Details")
def get_transaction_details(transaction_id: str) -> dict:
    """
    Retrieve detailed transaction information from the database.
    Args:
        transaction_id: The unique transaction identifier
    Returns:
        Dictionary containing transaction details
    """
    # In production, this would query your actual database
    # For this example, we simulate database access
    
    # Simulated transaction data
    transactions = {
        "TXN001": {
            "id": "TXN001",
            "amount": 4500.00,
            "merchant": "TechGadgets Online Store",
            "merchant_category": "electronics",
            "customer_id": "CUST12345",
            "timestamp": "2024-03-25T14:32:10Z",
            "location": "New York, USA",
            "payment_method": "credit_card",
            "card_last_four": "4782"
        },
        "TXN002": {
            "id": "TXN002",
            "amount": 150.00,
            "merchant": "Local Coffee Shop",
            "merchant_category": "food_beverage",
            "customer_id": "CUST12345",
            "timestamp": "2024-03-25T08:15:22Z",
            "location": "New York, USA",
            "payment_method": "debit_card",
            "card_last_four": "4782"
        }
    }
    
    return transactions.get(transaction_id, {"error": "Transaction not found"})
@tool("Get Customer Transaction History")
def get_customer_history(customer_id: str, days: int = 30) -> dict:
    """
    Retrieve customer transaction history for behavioral analysis.
    Args:
        customer_id: Customer identifier
        days: Number of days to look back (default 30)
    Returns:
        Dictionary with transaction statistics and patterns
    """
    # In production, this would query transaction history
    # For this example, we simulate the analysis
    
    return {
        "customer_id": customer_id,
        "analysis_period_days": days,
        "total_transactions": 47,
        "total_amount": 12350.00,
        "average_transaction": 262.77,
        "largest_transaction": 890.00,
        "most_common_category": "groceries",
        "typical_locations": ["New York, USA", "Brooklyn, USA"],
        "unusual_patterns": {
            "high_value_transactions": 2,
            "international_transactions": 0,
            "late_night_transactions": 3
        },
        "risk_indicators": {
            "velocity_normal": True,
            "amount_within_range": True,
            "location_consistent": True
        }
    }
@tool("Calculate Risk Score")
def calculate_risk_score(transaction_amount: float, 
                        merchant_category: str,
                        customer_avg_transaction: float,
                        is_unusual_time: bool = False,
                        is_unusual_location: bool = False) -> dict:
    """
    Calculate fraud risk score based on multiple factors.
    Args:
        transaction_amount: Current transaction amount
        merchant_category: Category of merchant
        customer_avg_transaction: Customer's average transaction amount
        is_unusual_time: Whether transaction occurred at unusual time
        is_unusual_location: Whether location is unusual for customer
    Returns:
        Risk score (0-100) and risk classification
    """
    risk_score = 0
    risk_factors = []
    
    # Amount-based risk
    amount_ratio = transaction_amount / customer_avg_transaction if customer_avg_transaction > 0 else 1
    if amount_ratio > 5:
        risk_score += 35
        risk_factors.append(f"Transaction {amount_ratio:.1f}x larger than average")
    elif amount_ratio > 3:
        risk_score += 20
        risk_factors.append(f"Transaction {amount_ratio:.1f}x larger than average")
    elif amount_ratio > 2:
        risk_score += 10
        risk_factors.append(f"Transaction moderately above average")
    
    # Category-based risk
    high_risk_categories = ["electronics", "jewelry", "wire_transfer", "crypto"]
    medium_risk_categories = ["travel", "online_gambling", "luxury_goods"]
    
    if merchant_category in high_risk_categories:
        risk_score += 25
        risk_factors.append(f"High-risk merchant category: {merchant_category}")
    elif merchant_category in medium_risk_categories:
        risk_score += 15
        risk_factors.append(f"Medium-risk merchant category: {merchant_category}")
    
    # Behavioral anomalies
    if is_unusual_time:
        risk_score += 15
        risk_factors.append("Transaction at unusual time")
    
    if is_unusual_location:
        risk_score += 20
        risk_factors.append("Transaction from unusual location")
    
    # Classify risk level
    if risk_score >= 70:
        risk_level = "HIGH"
        recommendation = "DECLINE"
    elif risk_score >= 40:
        risk_level = "MEDIUM"
        recommendation = "REVIEW"
    else:
        risk_level = "LOW"
        recommendation = "APPROVE"
    
    return {
        "risk_score": min(risk_score, 100),
        "risk_level": risk_level,
        "risk_factors": risk_factors,
        "recommendation": recommendation
    }

These tools demonstrate key patterns. The docstring is critical because agents read it to understand when and how to use the tool. The function signature defines what parameters the agent needs to provide. The return value gives the agent structured data to reason about.

Notice how the risk calculator implements actual business logic. In production, this would be more sophisticated, but the pattern is the same: encapsulate domain expertise in tools that agents can call.

Defining the Agent Team

Now we create three specialized agents. Each has a distinct role and access to specific tools.

Create agents.py:

from crewai import Agent
from llm_config import get_default_llm
from tools import (
    get_transaction_details,
    get_customer_history,
    calculate_risk_score
)

def create_fraud_detection_agents(llm=None):
    """
    Create a team of fraud detection agents.
    Returns a dictionary of agents for the crew.
    """
    if llm is None:
        llm = get_default_llm()
    
    # Agent 1: Transaction Investigator
    investigator = Agent(
        role="Senior Transaction Investigator",
        goal="Gather comprehensive transaction data and customer information",
        backstory="""You are a meticulous fraud analyst with 10 years of experience 
        in banking operations. You excel at gathering all relevant data about transactions 
        and customer behavior. You know which questions to ask and which data points matter 
        for fraud detection. You always retrieve complete information before passing it to 
        other team members.""",
        tools=[get_transaction_details, get_customer_history],
        llm=llm,
        verbose=True,
        allow_delegation=False
    )
    
    # Agent 2: Risk Analyst
    risk_analyst = Agent(
        role="Fraud Risk Assessment Specialist",
        goal="Analyze transaction patterns and calculate fraud risk scores",
        backstory="""You are a quantitative analyst specializing in fraud risk modeling. 
        You understand statistical patterns, behavioral anomalies, and risk scoring 
        methodologies. You take raw transaction data and customer history, identify 
        suspicious patterns, and calculate precise risk scores. Your assessments are 
        data-driven and follow established risk frameworks.""",
        tools=[calculate_risk_score],
        llm=llm,
        verbose=True,
        allow_delegation=False
    )
    
    # Agent 3: Decision Maker
    decision_maker = Agent(
        role="Fraud Decision Authority",
        goal="Make final fraud determinations and provide clear recommendations",
        backstory="""You are a senior fraud operations manager with authority to 
        approve, decline, or flag transactions for review. You synthesize findings 
        from investigators and analysts, apply business policies, consider customer 
        impact, and make balanced decisions. Your recommendations are clear, justified, 
        and actionable. You always explain your reasoning in terms that operations 
        teams can act on immediately.""",
        tools=[],  # Decision maker uses analysis from other agents, not tools directly
        llm=llm,
        verbose=True,
        allow_delegation=False
    )
    
    return {
        "investigator": investigator,
        "risk_analyst": risk_analyst,
        "decision_maker": decision_maker
    }

Notice the design choices here. The investigator has data gathering tools. The risk analyst has calculation tools. The decision maker has no tools because it synthesizes results from the other agents. This mirrors how real fraud teams work.

The backstory for each agent provides context that influences decision-making. The investigator is meticulous about data completeness. The risk analyst is quantitative and methodical. The decision maker balances multiple concerns. These personality traits emerge in how agents reason.

Defining Tasks for the Workflow

Tasks specify what each agent should accomplish and how outputs flow between agents.

Create tasks.py:

from crewai import Task

def create_fraud_detection_tasks(agents, transaction_id):
    """
    Create a sequential workflow of tasks for fraud detection.
    Each task builds on the output of the previous task.
    """
    
    # Task 1: Investigate Transaction
    investigation_task = Task(
        description=f"""Investigate transaction {transaction_id} thoroughly.
        
        Your responsibilities:
        1. Retrieve complete transaction details using the transaction details tool
        2. Get customer transaction history using the customer history tool
        3. Identify any unusual patterns or anomalies in the data
        4. Summarize all findings in a structured format
        
        Provide a comprehensive report that includes:
        - Transaction details (amount, merchant, category, location, time)
        - Customer profile (typical behavior, transaction patterns, history)
        - Any red flags or unusual observations
        - Context needed for risk assessment""",
        
        expected_output="""A detailed investigation report containing:
        - Complete transaction information
        - Customer behavioral profile
        - Identified anomalies or unusual patterns
        - Relevant context for risk analysis""",
        
        agent=agents["investigator"]
    )
    
    # Task 2: Assess Risk
    risk_assessment_task = Task(
        description="""Analyze the investigation findings and calculate fraud risk.
        
        Your responsibilities:
        1. Review the investigation report from the previous task
        2. Identify specific risk factors based on transaction characteristics
        3. Use the risk score calculator tool with appropriate parameters
        4. Provide quantitative risk assessment with clear justification
        
        Your analysis should include:
        - Risk score calculation with methodology
        - Risk level classification (LOW/MEDIUM/HIGH)
        - Specific risk factors identified
        - Comparison to customer's normal behavior""",
        
        expected_output="""A comprehensive risk assessment containing:
        - Calculated risk score (0-100)
        - Risk level classification
        - List of specific risk factors
        - Statistical analysis of transaction vs baseline behavior
        - Confidence level in the assessment""",
        
        agent=agents["risk_analyst"],
        context=[investigation_task]  # This task depends on investigation results
    )
    
    # Task 3: Make Decision
    decision_task = Task(
        description="""Make final fraud determination and provide action recommendation.
        
        Your responsibilities:
        1. Review investigation findings and risk assessment
        2. Apply fraud policies and business rules
        3. Consider customer impact and false positive costs
        4. Make a clear recommendation: APPROVE, REVIEW, or DECLINE
        5. Justify your decision with specific reasoning
        
        Your decision should include:
        - Clear action recommendation (APPROVE/REVIEW/DECLINE)
        - Primary justification for the decision
        - Next steps for operations team
        - Any additional monitoring recommendations""",
        
        expected_output="""A final fraud decision report containing:
        - Clear recommendation: APPROVE, REVIEW, or DECLINE
        - Risk summary from previous analysis
        - Detailed justification for the decision
        - Specific next steps for operations
        - Any additional monitoring or follow-up actions needed""",
        
        agent=agents["decision_maker"],
        context=[investigation_task, risk_assessment_task]  # Uses both previous results
    )
    
    return [investigation_task, risk_assessment_task, decision_task]

The task definitions are explicit about what each agent should do and what output format is expected. The context parameter creates dependencies, so agents can access results from previous tasks. This is how information flows through the workflow.

Building and Running the Crew

Now we bring everything together in the main execution script.

Create main.py:

import os
from dotenv import load_dotenv
from crewai import Crew, Process
from agents import create_fraud_detection_agents
from tasks import create_fraud_detection_tasks
from llm_config import get_openai_llm, get_anthropic_llm, get_ollama_llm

load_dotenv()
def run_fraud_detection(transaction_id, llm_provider="openai"):
    """
    Run the fraud detection crew on a specific transaction.
    
    Args:
        transaction_id: Transaction to analyze
        llm_provider: "openai", "anthropic", or "ollama"
    """
    print(f"\n{'='*80}")
    print(f"FRAUD DETECTION ANALYSIS - Transaction: {transaction_id}")
    print(f"LLM Provider: {llm_provider.upper()}")
    print(f"{'='*80}\n")
    
    # Select LLM based on provider
    if llm_provider == "openai":
        llm = get_openai_llm(model="gpt-4")
    elif llm_provider == "anthropic":
        llm = get_anthropic_llm()
    elif llm_provider == "ollama":
        llm = get_ollama_llm(model="llama3.1:8b")
    else:
        raise ValueError(f"Unknown LLM provider: {llm_provider}")
    
    # Create agents with selected LLM
    agents = create_fraud_detection_agents(llm=llm)
    
    # Create tasks for this transaction
    tasks = create_fraud_detection_tasks(agents, transaction_id)
    
    # Assemble the crew
    crew = Crew(
        agents=list(agents.values()),
        tasks=tasks,
        process=Process.sequential,  # Tasks execute in order
        verbose=True
    )
    
    # Execute the workflow
    try:
        result = crew.kickoff()
        
        print(f"\n{'='*80}")
        print("FINAL DECISION")
        print(f"{'='*80}\n")
        print(result)
        
        return result
    
    except Exception as e:
        print(f"\nError during fraud detection: {str(e)}")
        return None
if __name__ == "__main__":
    # Example 1: Analyze high-value electronics purchase
    result1 = run_fraud_detection("TXN001", llm_provider="openai")
    
    # Example 2: Analyze normal coffee shop transaction
    # result2 = run_fraud_detection("TXN002", llm_provider="openai")
    
    # Try with different LLM providers
    # result3 = run_fraud_detection("TXN001", llm_provider="anthropic")
    # result4 = run_fraud_detection("TXN001", llm_provider="ollama")

This script orchestrates everything. It selects the LLM, creates agents, defines tasks, assembles the crew, and executes the workflow. The verbose flag shows you exactly what each agent is thinking and doing at each step.

Running the Complete Example

Now run the fraud detection system:

python main.py

You will see detailed output showing each agent’s reasoning process. The investigator retrieves transaction data and customer history. The risk analyst calculates scores and identifies risk factors. The decision maker synthesizes everything and provides a recommendation.

The output looks something like this:

================================================================================
FRAUD DETECTION ANALYSIS - Transaction: TXN001
LLM Provider: OPENAI
================================================================================

[Agent: Senior Transaction Investigator]
Starting investigation of transaction TXN001...
[Tool: Get Transaction Details]
Retrieved transaction data: $4,500 purchase at TechGadgets Online Store...
[Tool: Get Customer Transaction History]
Customer CUST12345 analysis: 47 transactions over 30 days, average $262.77...
Investigation findings:
- Transaction amount ($4,500) is 17x higher than customer average
- Merchant category: electronics (high-risk)
- Transaction time and location consistent with customer patterns
- No prior high-value electronics purchases in history
[Agent: Fraud Risk Assessment Specialist]
Analyzing risk factors from investigation...
[Tool: Calculate Risk Score]
Computing risk score with parameters:
- Transaction amount: $4,500
- Customer average: $262.77
- Merchant category: electronics
- Unusual patterns: None detected in time/location
Risk Assessment:
- Risk Score: 60/100
- Risk Level: MEDIUM
- Primary factors: Amount anomaly (17x average), high-risk category
[Agent: Fraud Decision Authority]
Reviewing all findings to make final determination...
DECISION: REVIEW
Justification: Transaction shows moderate risk due to amount anomaly but lacks
other fraud indicators. Customer has consistent location/time patterns. 
Recommend manual review before approval.
Next Steps:
1. Contact customer to verify purchase intent
2. If confirmed, approve transaction
3. If unconfirmed, decline and issue fraud alert

This is a real agent workflow. The investigator gathered data, the analyst calculated risk, and the decision maker provided clear guidance.

Performance Comparison Across LLM Providers

Let me share actual performance observations from running this same fraud detection crew with different LLMs.

OpenAI GPT-4

Strengths:

Excellent at following complex instructions
Consistent reasoning quality
Good at using tools correctly on first attempt
Produces well-structured reports

Weaknesses:

Higher cost ($0.03 per 1K input tokens)
Slower response times (3–5 seconds per agent)
Rate limits can be restrictive for high volume

Best for: Complex fraud scenarios requiring nuanced judgment

Anthropic Claude Sonnet

Strengths:

Strong analytical reasoning
Excellent at long-context processing
More affordable than GPT-4 ($0.003 per 1K tokens)
Very good at compliance-oriented tasks

Weaknesses:

Occasionally over-explains reasoning
Can be conservative in risk assessment

Best for: Scenarios requiring detailed analysis and audit trails

Llama 3.1 (via Ollama)

Strengths:

Zero API costs after initial setup
Fast response times (local execution)
Complete data privacy (no external API calls)
Good performance on structured tasks

Weaknesses:

Requires local compute resources
Sometimes needs more specific prompts
May miss nuances in complex scenarios
Quality depends on model size (8B vs 70B)

Best for: High-volume processing, privacy-sensitive workloads

Cost Analysis

For analyzing 1,000 transactions per day:

GPT-4: ~$45–60/day (depending on conversation length) Claude Sonnet: ~$5–8/day Llama (local): $0 API costs (electricity + compute amortized)

The choice depends on your priorities: quality, cost, privacy, or throughput.

Output Handling and Error Management

Production systems need robust error handling. Here is how to improve the main script:

def run_fraud_detection_safe(transaction_id, llm_provider="openai", max_retries=3):
    """
    Run fraud detection with error handling and retry logic.
    """
    for attempt in range(max_retries):
        try:
            result = run_fraud_detection(transaction_id, llm_provider)
            
            if result is None:
                raise ValueError("Crew returned no result")
            
            # Validate result format
            if not isinstance(result, str) or len(result) < 50:
                raise ValueError("Invalid result format")
            
            # Save result to file
            output_file = f"results/fraud_analysis_{transaction_id}.txt"
            os.makedirs("results", exist_ok=True)
            with open(output_file, 'w') as f:
                f.write(result)
            
            print(f"\nResult saved to: {output_file}")
            return result
            
        except Exception as e:
            print(f"\nAttempt {attempt + 1} failed: {str(e)}")
            if attempt < max_retries - 1:
                print("Retrying...")
                continue
            else:
                print("Max retries exceeded. Analysis failed.")
                return None

This adds retry logic, result validation, and persistent storage. In production, you would also add logging, metrics collection, and alerting.

Key Takeaways from This Implementation

What did we build? A three-agent fraud detection system that:

Gathers transaction and customer data autonomously
Calculates risk scores using domain-specific logic
Makes decisions with clear justification
Runs on multiple LLM providers with one configuration change
Follows a sequential workflow where each agent builds on previous results

This is the foundation pattern for agentic AI in banking. The same architecture scales to more complex scenarios by adding agents, tools, and workflow steps.

In Part 3, we will build on this foundation with hierarchical workflows, manager agents coordinating multiple specialists, and more sophisticated use cases like credit assessment and customer service automation. We will also cover advanced topics like context optimization, parallel execution, and agent memory.

What You Should Do Next

Clone this code and experiment. Change the transaction amounts and see how risk scores adjust. Modify agent backstories and observe how decision-making changes. Swap LLM providers and compare outputs. Add new tools for merchant verification or geographic risk assessment.

The best way to understand agentic AI is to build with it. This implementation gives you a working foundation. Extend it to match your specific fraud detection requirements.

If you found this guide valuable, give it a clap and leave a comment about your experiments. What fraud patterns are you trying to detect? Which LLM provider worked best for your use case? What challenges did you encounter?

Follow me for Part 3 where we tackle more complex multi-agent workflows with hierarchical coordination and advanced banking use cases.

About This Series: This is Part 2 of a 4-part series on building multi-agent AI systems for banking with CrewAI.

Part 1: Foundation and concepts of agentic AI Part 2 (this article): Basic implementation with fraud detection Part 3 (coming soon): Intermediate workflows for customer service and credit assessment Part 4 (coming soon): Production-ready transaction reconciliation system

Follow me to get notified when the next parts are published. Share this with colleagues exploring AI in banking or financial services.

Building Multi-Agent AI Systems for Banking: Simple Task Automation with CrewAI (Part 2) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hands-on implementation of a basic fraud detection agent system with step-by-step code walkthrough

Environment Setup and Configuration

Creating the Project Structure

Installing Dependencies

Configuring API Keys

Project File Structure

Configuring Multiple LLM Providers

Building Custom Tools for Transaction Analysis

Defining the Agent Team

Defining Tasks for the Workflow

Building and Running the Crew

Running the Complete Example

Performance Comparison Across LLM Providers

OpenAI GPT-4

Anthropic Claude Sonnet

Llama 3.1 (via Ollama)

Cost Analysis

Output Handling and Error Management

Key Takeaways from This Implementation

What You Should Do Next

Leave a Comment