Hands-on implementation of a basic fraud detection agent system with step-by-step code walkthrough

In Part 1, we explored the theory behind agentic AI and why multi-agent systems matter in banking. Today, we build something real. By the end of this guide, you will have a working fraud detection system with three specialized agents collaborating to analyze suspicious transactions.
This is not a theoretical exercise. The code you write today follows the same patterns I use in production systems that process millions of transactions. The difference is scale, not architecture.
We will start with environment setup, build three agents with distinct responsibilities, give them tools to query data and calculate risk, orchestrate their collaboration, and run the complete workflow. You will see exactly how agents reason, make decisions, and produce actionable outputs. We will also compare performance across different LLM providers so you understand the tradeoffs between OpenAI, Anthropic, and open-source models.
If you have not read Part 1, I recommend starting there to understand the foundational concepts. This article assumes you know what agents, tasks, and crews are.
Let’s build.
Environment Setup and Configuration
Before writing any agent code, we need a clean Python environment with the right dependencies. I always start projects with a virtual environment to avoid package conflicts.
Creating the Project Structure
Open your terminal and create a new directory for this project.
mkdir fraud-detection-crew
cd fraud-detection-crew
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Installing Dependencies
We need CrewAI, LangChain integrations for multiple LLM providers, and supporting libraries.
pip install crewai==0.28.8 crewai-tools==0.1.6
pip install langchain-openai langchain-anthropic langchain-community
pip install python-dotenv pandas
Why these specific packages? CrewAI is the core framework. The langchain packages provide LLM integrations. Python-dotenv manages API keys securely. Pandas helps with data manipulation for our fraud analysis.
Configuring API Keys
Create a .env file in your project directory to store API keys. Never hardcode keys in your source files.
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
If you are using open-source models like Llama through Ollama, you do not need API keys, but you need Ollama installed and running locally. We will cover that shortly.
Project File Structure
Create this structure to keep code organized:
fraud-detection-crew/
├── .env
├── main.py
├── agents.py
├── tasks.py
├── tools.py
└── data/
└── sample_transactions.json
This separation makes the code maintainable. Agents go in one file, tasks in another, tools in a third. The main script orchestrates everything.
Configuring Multiple LLM Providers
One of CrewAI’s strengths is LLM flexibility. Let me show you how to configure OpenAI, Anthropic Claude, and Llama so you can switch between them easily.
Create a file called llm_config.py:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_community.llms import Ollama
load_dotenv()
def get_openai_llm(model="gpt-4", temperature=0.1):
"""
Configure OpenAI models.
Best for: Complex reasoning, following detailed instructions
Cost: Higher ($0.03/1K tokens for GPT-4)
"""
return ChatOpenAI(
model=model,
temperature=temperature,
api_key=os.getenv("OPENAI_API_KEY")
)
def get_anthropic_llm(model="claude-3-5-sonnet-20241022", temperature=0.1):
"""
Configure Anthropic Claude models.
Best for: Long context, nuanced analysis, compliance review
Cost: Moderate ($0.003/1K tokens for Sonnet)
"""
return ChatAnthropic(
model=model,
temperature=temperature,
api_key=os.getenv("ANTHROPIC_API_KEY")
)
def get_ollama_llm(model="llama3.1:8b", temperature=0.1):
"""
Configure Ollama for local models.
Best for: Privacy-sensitive workloads, high volume, no per-token cost
Cost: Free (requires local compute)
Note: Requires Ollama installed and running
"""
return Ollama(
model=model,
temperature=temperature
)
# Default LLM for quick testing
def get_default_llm():
"""Returns GPT-4 by default, but you can change this"""
return get_openai_llm(model="gpt-4", temperature=0.1)
This configuration makes it trivial to swap models. Change one function call and your entire crew uses a different LLM. In production, I often use GPT-4 for the complex reasoning agent, Claude for compliance review, and Llama for data extraction where privacy matters.
Building Custom Tools for Transaction Analysis
Agents need tools to interact with data and perform calculations. Let’s build three tools our fraud detection agents will use.
Create tools.py:
from crewai_tools import tool
import json
import random
from datetime import datetime, timedelta
@tool("Get Transaction Details")
def get_transaction_details(transaction_id: str) -> dict:
"""
Retrieve detailed transaction information from the database.
Args:
transaction_id: The unique transaction identifier
Returns:
Dictionary containing transaction details
"""
# In production, this would query your actual database
# For this example, we simulate database access
# Simulated transaction data
transactions = {
"TXN001": {
"id": "TXN001",
"amount": 4500.00,
"merchant": "TechGadgets Online Store",
"merchant_category": "electronics",
"customer_id": "CUST12345",
"timestamp": "2024-03-25T14:32:10Z",
"location": "New York, USA",
"payment_method": "credit_card",
"card_last_four": "4782"
},
"TXN002": {
"id": "TXN002",
"amount": 150.00,
"merchant": "Local Coffee Shop",
"merchant_category": "food_beverage",
"customer_id": "CUST12345",
"timestamp": "2024-03-25T08:15:22Z",
"location": "New York, USA",
"payment_method": "debit_card",
"card_last_four": "4782"
}
}
return transactions.get(transaction_id, {"error": "Transaction not found"})
@tool("Get Customer Transaction History")
def get_customer_history(customer_id: str, days: int = 30) -> dict:
"""
Retrieve customer transaction history for behavioral analysis.
Args:
customer_id: Customer identifier
days: Number of days to look back (default 30)
Returns:
Dictionary with transaction statistics and patterns
"""
# In production, this would query transaction history
# For this example, we simulate the analysis
return {
"customer_id": customer_id,
"analysis_period_days": days,
"total_transactions": 47,
"total_amount": 12350.00,
"average_transaction": 262.77,
"largest_transaction": 890.00,
"most_common_category": "groceries",
"typical_locations": ["New York, USA", "Brooklyn, USA"],
"unusual_patterns": {
"high_value_transactions": 2,
"international_transactions": 0,
"late_night_transactions": 3
},
"risk_indicators": {
"velocity_normal": True,
"amount_within_range": True,
"location_consistent": True
}
}
@tool("Calculate Risk Score")
def calculate_risk_score(transaction_amount: float,
merchant_category: str,
customer_avg_transaction: float,
is_unusual_time: bool = False,
is_unusual_location: bool = False) -> dict:
"""
Calculate fraud risk score based on multiple factors.
Args:
transaction_amount: Current transaction amount
merchant_category: Category of merchant
customer_avg_transaction: Customer's average transaction amount
is_unusual_time: Whether transaction occurred at unusual time
is_unusual_location: Whether location is unusual for customer
Returns:
Risk score (0-100) and risk classification
"""
risk_score = 0
risk_factors = []
# Amount-based risk
amount_ratio = transaction_amount / customer_avg_transaction if customer_avg_transaction > 0 else 1
if amount_ratio > 5:
risk_score += 35
risk_factors.append(f"Transaction {amount_ratio:.1f}x larger than average")
elif amount_ratio > 3:
risk_score += 20
risk_factors.append(f"Transaction {amount_ratio:.1f}x larger than average")
elif amount_ratio > 2:
risk_score += 10
risk_factors.append(f"Transaction moderately above average")
# Category-based risk
high_risk_categories = ["electronics", "jewelry", "wire_transfer", "crypto"]
medium_risk_categories = ["travel", "online_gambling", "luxury_goods"]
if merchant_category in high_risk_categories:
risk_score += 25
risk_factors.append(f"High-risk merchant category: {merchant_category}")
elif merchant_category in medium_risk_categories:
risk_score += 15
risk_factors.append(f"Medium-risk merchant category: {merchant_category}")
# Behavioral anomalies
if is_unusual_time:
risk_score += 15
risk_factors.append("Transaction at unusual time")
if is_unusual_location:
risk_score += 20
risk_factors.append("Transaction from unusual location")
# Classify risk level
if risk_score >= 70:
risk_level = "HIGH"
recommendation = "DECLINE"
elif risk_score >= 40:
risk_level = "MEDIUM"
recommendation = "REVIEW"
else:
risk_level = "LOW"
recommendation = "APPROVE"
return {
"risk_score": min(risk_score, 100),
"risk_level": risk_level,
"risk_factors": risk_factors,
"recommendation": recommendation
}
These tools demonstrate key patterns. The docstring is critical because agents read it to understand when and how to use the tool. The function signature defines what parameters the agent needs to provide. The return value gives the agent structured data to reason about.
Notice how the risk calculator implements actual business logic. In production, this would be more sophisticated, but the pattern is the same: encapsulate domain expertise in tools that agents can call.
Defining the Agent Team
Now we create three specialized agents. Each has a distinct role and access to specific tools.
Create agents.py:
from crewai import Agent
from llm_config import get_default_llm
from tools import (
get_transaction_details,
get_customer_history,
calculate_risk_score
)
def create_fraud_detection_agents(llm=None):
"""
Create a team of fraud detection agents.
Returns a dictionary of agents for the crew.
"""
if llm is None:
llm = get_default_llm()
# Agent 1: Transaction Investigator
investigator = Agent(
role="Senior Transaction Investigator",
goal="Gather comprehensive transaction data and customer information",
backstory="""You are a meticulous fraud analyst with 10 years of experience
in banking operations. You excel at gathering all relevant data about transactions
and customer behavior. You know which questions to ask and which data points matter
for fraud detection. You always retrieve complete information before passing it to
other team members.""",
tools=[get_transaction_details, get_customer_history],
llm=llm,
verbose=True,
allow_delegation=False
)
# Agent 2: Risk Analyst
risk_analyst = Agent(
role="Fraud Risk Assessment Specialist",
goal="Analyze transaction patterns and calculate fraud risk scores",
backstory="""You are a quantitative analyst specializing in fraud risk modeling.
You understand statistical patterns, behavioral anomalies, and risk scoring
methodologies. You take raw transaction data and customer history, identify
suspicious patterns, and calculate precise risk scores. Your assessments are
data-driven and follow established risk frameworks.""",
tools=[calculate_risk_score],
llm=llm,
verbose=True,
allow_delegation=False
)
# Agent 3: Decision Maker
decision_maker = Agent(
role="Fraud Decision Authority",
goal="Make final fraud determinations and provide clear recommendations",
backstory="""You are a senior fraud operations manager with authority to
approve, decline, or flag transactions for review. You synthesize findings
from investigators and analysts, apply business policies, consider customer
impact, and make balanced decisions. Your recommendations are clear, justified,
and actionable. You always explain your reasoning in terms that operations
teams can act on immediately.""",
tools=[], # Decision maker uses analysis from other agents, not tools directly
llm=llm,
verbose=True,
allow_delegation=False
)
return {
"investigator": investigator,
"risk_analyst": risk_analyst,
"decision_maker": decision_maker
}
Notice the design choices here. The investigator has data gathering tools. The risk analyst has calculation tools. The decision maker has no tools because it synthesizes results from the other agents. This mirrors how real fraud teams work.
The backstory for each agent provides context that influences decision-making. The investigator is meticulous about data completeness. The risk analyst is quantitative and methodical. The decision maker balances multiple concerns. These personality traits emerge in how agents reason.
Defining Tasks for the Workflow
Tasks specify what each agent should accomplish and how outputs flow between agents.
Create tasks.py:
from crewai import Task
def create_fraud_detection_tasks(agents, transaction_id):
"""
Create a sequential workflow of tasks for fraud detection.
Each task builds on the output of the previous task.
"""
# Task 1: Investigate Transaction
investigation_task = Task(
description=f"""Investigate transaction {transaction_id} thoroughly.
Your responsibilities:
1. Retrieve complete transaction details using the transaction details tool
2. Get customer transaction history using the customer history tool
3. Identify any unusual patterns or anomalies in the data
4. Summarize all findings in a structured format
Provide a comprehensive report that includes:
- Transaction details (amount, merchant, category, location, time)
- Customer profile (typical behavior, transaction patterns, history)
- Any red flags or unusual observations
- Context needed for risk assessment""",
expected_output="""A detailed investigation report containing:
- Complete transaction information
- Customer behavioral profile
- Identified anomalies or unusual patterns
- Relevant context for risk analysis""",
agent=agents["investigator"]
)
# Task 2: Assess Risk
risk_assessment_task = Task(
description="""Analyze the investigation findings and calculate fraud risk.
Your responsibilities:
1. Review the investigation report from the previous task
2. Identify specific risk factors based on transaction characteristics
3. Use the risk score calculator tool with appropriate parameters
4. Provide quantitative risk assessment with clear justification
Your analysis should include:
- Risk score calculation with methodology
- Risk level classification (LOW/MEDIUM/HIGH)
- Specific risk factors identified
- Comparison to customer's normal behavior""",
expected_output="""A comprehensive risk assessment containing:
- Calculated risk score (0-100)
- Risk level classification
- List of specific risk factors
- Statistical analysis of transaction vs baseline behavior
- Confidence level in the assessment""",
agent=agents["risk_analyst"],
context=[investigation_task] # This task depends on investigation results
)
# Task 3: Make Decision
decision_task = Task(
description="""Make final fraud determination and provide action recommendation.
Your responsibilities:
1. Review investigation findings and risk assessment
2. Apply fraud policies and business rules
3. Consider customer impact and false positive costs
4. Make a clear recommendation: APPROVE, REVIEW, or DECLINE
5. Justify your decision with specific reasoning
Your decision should include:
- Clear action recommendation (APPROVE/REVIEW/DECLINE)
- Primary justification for the decision
- Next steps for operations team
- Any additional monitoring recommendations""",
expected_output="""A final fraud decision report containing:
- Clear recommendation: APPROVE, REVIEW, or DECLINE
- Risk summary from previous analysis
- Detailed justification for the decision
- Specific next steps for operations
- Any additional monitoring or follow-up actions needed""",
agent=agents["decision_maker"],
context=[investigation_task, risk_assessment_task] # Uses both previous results
)
return [investigation_task, risk_assessment_task, decision_task]
The task definitions are explicit about what each agent should do and what output format is expected. The context parameter creates dependencies, so agents can access results from previous tasks. This is how information flows through the workflow.
Building and Running the Crew
Now we bring everything together in the main execution script.
Create main.py:
import os
from dotenv import load_dotenv
from crewai import Crew, Process
from agents import create_fraud_detection_agents
from tasks import create_fraud_detection_tasks
from llm_config import get_openai_llm, get_anthropic_llm, get_ollama_llm
load_dotenv()
def run_fraud_detection(transaction_id, llm_provider="openai"):
"""
Run the fraud detection crew on a specific transaction.
Args:
transaction_id: Transaction to analyze
llm_provider: "openai", "anthropic", or "ollama"
"""
print(f"\n{'='*80}")
print(f"FRAUD DETECTION ANALYSIS - Transaction: {transaction_id}")
print(f"LLM Provider: {llm_provider.upper()}")
print(f"{'='*80}\n")
# Select LLM based on provider
if llm_provider == "openai":
llm = get_openai_llm(model="gpt-4")
elif llm_provider == "anthropic":
llm = get_anthropic_llm()
elif llm_provider == "ollama":
llm = get_ollama_llm(model="llama3.1:8b")
else:
raise ValueError(f"Unknown LLM provider: {llm_provider}")
# Create agents with selected LLM
agents = create_fraud_detection_agents(llm=llm)
# Create tasks for this transaction
tasks = create_fraud_detection_tasks(agents, transaction_id)
# Assemble the crew
crew = Crew(
agents=list(agents.values()),
tasks=tasks,
process=Process.sequential, # Tasks execute in order
verbose=True
)
# Execute the workflow
try:
result = crew.kickoff()
print(f"\n{'='*80}")
print("FINAL DECISION")
print(f"{'='*80}\n")
print(result)
return result
except Exception as e:
print(f"\nError during fraud detection: {str(e)}")
return None
if __name__ == "__main__":
# Example 1: Analyze high-value electronics purchase
result1 = run_fraud_detection("TXN001", llm_provider="openai")
# Example 2: Analyze normal coffee shop transaction
# result2 = run_fraud_detection("TXN002", llm_provider="openai")
# Try with different LLM providers
# result3 = run_fraud_detection("TXN001", llm_provider="anthropic")
# result4 = run_fraud_detection("TXN001", llm_provider="ollama")
This script orchestrates everything. It selects the LLM, creates agents, defines tasks, assembles the crew, and executes the workflow. The verbose flag shows you exactly what each agent is thinking and doing at each step.
Running the Complete Example
Now run the fraud detection system:
python main.py
You will see detailed output showing each agent’s reasoning process. The investigator retrieves transaction data and customer history. The risk analyst calculates scores and identifies risk factors. The decision maker synthesizes everything and provides a recommendation.
The output looks something like this:
================================================================================
FRAUD DETECTION ANALYSIS - Transaction: TXN001
LLM Provider: OPENAI
================================================================================
[Agent: Senior Transaction Investigator]
Starting investigation of transaction TXN001...
[Tool: Get Transaction Details]
Retrieved transaction data: $4,500 purchase at TechGadgets Online Store...
[Tool: Get Customer Transaction History]
Customer CUST12345 analysis: 47 transactions over 30 days, average $262.77...
Investigation findings:
- Transaction amount ($4,500) is 17x higher than customer average
- Merchant category: electronics (high-risk)
- Transaction time and location consistent with customer patterns
- No prior high-value electronics purchases in history
[Agent: Fraud Risk Assessment Specialist]
Analyzing risk factors from investigation...
[Tool: Calculate Risk Score]
Computing risk score with parameters:
- Transaction amount: $4,500
- Customer average: $262.77
- Merchant category: electronics
- Unusual patterns: None detected in time/location
Risk Assessment:
- Risk Score: 60/100
- Risk Level: MEDIUM
- Primary factors: Amount anomaly (17x average), high-risk category
[Agent: Fraud Decision Authority]
Reviewing all findings to make final determination...
DECISION: REVIEW
Justification: Transaction shows moderate risk due to amount anomaly but lacks
other fraud indicators. Customer has consistent location/time patterns.
Recommend manual review before approval.
Next Steps:
1. Contact customer to verify purchase intent
2. If confirmed, approve transaction
3. If unconfirmed, decline and issue fraud alert
This is a real agent workflow. The investigator gathered data, the analyst calculated risk, and the decision maker provided clear guidance.
Performance Comparison Across LLM Providers
Let me share actual performance observations from running this same fraud detection crew with different LLMs.
OpenAI GPT-4
Strengths:
- Excellent at following complex instructions
- Consistent reasoning quality
- Good at using tools correctly on first attempt
- Produces well-structured reports
Weaknesses:
- Higher cost ($0.03 per 1K input tokens)
- Slower response times (3–5 seconds per agent)
- Rate limits can be restrictive for high volume
Best for: Complex fraud scenarios requiring nuanced judgment
Anthropic Claude Sonnet
Strengths:
- Strong analytical reasoning
- Excellent at long-context processing
- More affordable than GPT-4 ($0.003 per 1K tokens)
- Very good at compliance-oriented tasks
Weaknesses:
- Occasionally over-explains reasoning
- Can be conservative in risk assessment
Best for: Scenarios requiring detailed analysis and audit trails
Llama 3.1 (via Ollama)
Strengths:
- Zero API costs after initial setup
- Fast response times (local execution)
- Complete data privacy (no external API calls)
- Good performance on structured tasks
Weaknesses:
- Requires local compute resources
- Sometimes needs more specific prompts
- May miss nuances in complex scenarios
- Quality depends on model size (8B vs 70B)
Best for: High-volume processing, privacy-sensitive workloads
Cost Analysis
For analyzing 1,000 transactions per day:
GPT-4: ~$45–60/day (depending on conversation length) Claude Sonnet: ~$5–8/day Llama (local): $0 API costs (electricity + compute amortized)
The choice depends on your priorities: quality, cost, privacy, or throughput.
Output Handling and Error Management
Production systems need robust error handling. Here is how to improve the main script:
def run_fraud_detection_safe(transaction_id, llm_provider="openai", max_retries=3):
"""
Run fraud detection with error handling and retry logic.
"""
for attempt in range(max_retries):
try:
result = run_fraud_detection(transaction_id, llm_provider)
if result is None:
raise ValueError("Crew returned no result")
# Validate result format
if not isinstance(result, str) or len(result) < 50:
raise ValueError("Invalid result format")
# Save result to file
output_file = f"results/fraud_analysis_{transaction_id}.txt"
os.makedirs("results", exist_ok=True)
with open(output_file, 'w') as f:
f.write(result)
print(f"\nResult saved to: {output_file}")
return result
except Exception as e:
print(f"\nAttempt {attempt + 1} failed: {str(e)}")
if attempt < max_retries - 1:
print("Retrying...")
continue
else:
print("Max retries exceeded. Analysis failed.")
return None
This adds retry logic, result validation, and persistent storage. In production, you would also add logging, metrics collection, and alerting.
Key Takeaways from This Implementation
What did we build? A three-agent fraud detection system that:
- Gathers transaction and customer data autonomously
- Calculates risk scores using domain-specific logic
- Makes decisions with clear justification
- Runs on multiple LLM providers with one configuration change
- Follows a sequential workflow where each agent builds on previous results
This is the foundation pattern for agentic AI in banking. The same architecture scales to more complex scenarios by adding agents, tools, and workflow steps.
In Part 3, we will build on this foundation with hierarchical workflows, manager agents coordinating multiple specialists, and more sophisticated use cases like credit assessment and customer service automation. We will also cover advanced topics like context optimization, parallel execution, and agent memory.
What You Should Do Next
Clone this code and experiment. Change the transaction amounts and see how risk scores adjust. Modify agent backstories and observe how decision-making changes. Swap LLM providers and compare outputs. Add new tools for merchant verification or geographic risk assessment.
The best way to understand agentic AI is to build with it. This implementation gives you a working foundation. Extend it to match your specific fraud detection requirements.
If you found this guide valuable, give it a clap and leave a comment about your experiments. What fraud patterns are you trying to detect? Which LLM provider worked best for your use case? What challenges did you encounter?
Follow me for Part 3 where we tackle more complex multi-agent workflows with hierarchical coordination and advanced banking use cases.
About This Series: This is Part 2 of a 4-part series on building multi-agent AI systems for banking with CrewAI.
Part 1: Foundation and concepts of agentic AI Part 2 (this article): Basic implementation with fraud detection Part 3 (coming soon): Intermediate workflows for customer service and credit assessment Part 4 (coming soon): Production-ready transaction reconciliation system
Follow me to get notified when the next parts are published. Share this with colleagues exploring AI in banking or financial services.
Building Multi-Agent AI Systems for Banking: Simple Task Automation with CrewAI (Part 2) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.