Building AI Agents Part 1: Defining Purpose, Designing Prompts, and Selecting Models

The critical first steps that determine whether your AI agent succeeds or fails in production — with real examples from banking, retail, and healthcare

A healthcare startup spent six months building an AI agent for patient triage. They used the latest GPT-4 model. They hired experienced ML engineers. They built a beautiful interface. The demo impressed investors.

Then they launched to real clinics.

Within days, nurses stopped using it. The agent asked irrelevant questions. It missed critical symptoms. It provided inconsistent advice. Sometimes it was too cautious, sending patients with minor issues to emergency rooms. Other times it was too aggressive, dismissing serious conditions.

The problem was not the model. The problem was not the code. The problem was foundation. They skipped the critical first steps. They never properly defined the purpose. They rushed through prompt design. They chose the wrong model configuration.

Three months and significant rework later, they got it right. Today their agent handles over 10,000 patient interactions daily with 94% nurse satisfaction and measurably better patient outcomes.

This article covers the three foundation steps that determine whether your AI agent succeeds or fails: defining purpose, designing prompts, and selecting models. Get these right, and everything else becomes easier. Get them wrong, and no amount of engineering will save you.

This is part of the series Building Production AI Agents: A Complete Architecture Guide, where we walk through an 8-step framework to take agents from concept to deployment, with practical patterns and examples across banking, healthcare, retail, manufacturing, and beyond.

Step 1: Define Purpose and Scope

Before writing a single line of code, you must answer four questions with precision: What problem are you solving? Who are your users? What does success look like? What are your constraints?

Vague answers lead to failed projects. Specific answers lead to production systems.

Identifying the Right Use Case

Not every problem needs an AI agent. Rule-based systems work better for simple workflows. Traditional ML models work better for pure prediction tasks. Human experts work better for high-stakes decisions requiring accountability.

AI agents excel when you need autonomous decision-making across complex, multi-step workflows with access to multiple information sources and tools.

Consider a banking fraud detection scenario. A simple rule catches obvious fraud patterns like duplicate transactions or geographic impossibilities. But sophisticated fraud requires investigation. The agent must analyze transaction history, compare against customer behavior patterns, check merchant reputation databases, calculate risk scores using multiple models, correlate with known fraud networks, and generate evidence packages for human review.

This requires autonomy (operating 24/7 across thousands of transactions), tool use (accessing multiple databases and APIs), reasoning (connecting disparate signals), and workflow orchestration (multi-step investigation process). Perfect for an AI agent.

Contrast this with a simple customer service query like checking account balance. That needs a basic API call, not an autonomous agent. Use the simplest solution that works.

Understanding User Needs

Your users determine your architecture. A field technician needs mobile access with offline capability. A data analyst needs dashboards with drill-down investigation. An automated system needs clean APIs with predictable response formats.

In retail inventory management, store managers need simple recommendations they can accept or override. “Reorder 150 units of Product X by Friday” is actionable. “Inventory optimization score: 0.847” is useless. The agent must speak the user’s language and fit their workflow.
For agricultural monitoring agents serving smallholder farmers in rural areas, consider connectivity constraints. The agent might need SMS alerts instead of a web dashboard. It must work with intermittent internet. It must provide actionable advice in local languages. “Check field 3 for pest damage this afternoon” beats “Anomaly detected in sector 3 with 0.73 confidence.”
Healthcare clinicians need different interfaces than patients. A doctor reviewing agent recommendations needs detailed clinical reasoning, confidence scores, and references to medical literature. A patient needs simplified explanations and clear next steps. Same underlying agent, different interfaces.

Setting Success Criteria

Vague goals like “improve efficiency” guarantee failure. Measurable targets drive decisions.

For a manufacturing quality control agent, success might mean: detect 95% of defects that human inspectors catch, reduce false positive rate below 10%, process inspection data within 2 seconds per item, decrease overall defect escape rate by 30%, maintain 99.9% uptime during production hours.

These numbers shape every architectural decision. The 2-second latency requirement rules out certain models. The 95% detection rate requires extensive training data. The 99.9% uptime demands redundancy and error handling.

For an aviation maintenance agent, success criteria include different dimensions: predict component failures 30 days in advance with 80% accuracy, reduce unscheduled maintenance events by 40%, ensure 100% regulatory compliance in maintenance recommendations, never recommend deferring safety-critical maintenance, generate complete audit trails for all decisions.

The 100% compliance requirement demands rigorous testing and human oversight. The safety-critical constraint requires conservative decision thresholds. These are architectural requirements, not post-deployment optimizations.

Defining Constraints

Every production system faces constraints. Technical constraints like latency requirements, infrastructure limitations, and integration requirements. Business constraints like budget, timeline, and organizational capabilities. Regulatory constraints like data privacy, industry compliance, and audit requirements.

In banking, regulatory constraints dominate. A fraud detection agent must maintain complete audit trails. It must explain every decision. It must never access customer data without authorization. It must comply with specific regulations that vary by jurisdiction. These constraints are not negotiable.
For healthcare agents, HIPAA compliance is mandatory. Patient data must be encrypted at rest and in transit. Access must be logged and auditable. Data retention must follow specific rules. The agent cannot store certain information even temporarily. These constraints directly impact memory architecture and deployment options.
Retail agents face different constraints. Peak load during holiday shopping requires horizontal scalability. Integration with legacy point-of-sale systems limits technology choices. Cost sensitivity drives model selection toward cheaper options. These constraints shape the entire technical stack.

Document your constraints explicitly. They determine what is possible.

Multi-Industry Use Case Examples

Banking: Transaction Monitoring Agent Purpose: Detect and investigate suspicious transactions in real-time across millions of daily transactions. Users: Fraud analysts and compliance officers. Success: Catch 90% of fraud with false positive rate under 5%, investigate each case within 30 seconds, maintain complete audit trail. Constraints: Regulatory compliance, explain every decision, never block legitimate transactions without analyst review, operate 24/7 with 99.95% uptime.
Retail: Dynamic Pricing Agent Purpose: Optimize pricing across thousands of SKUs based on demand, competition, inventory levels, and margin requirements. Users: Category managers and pricing analysts. Success: Increase revenue by 8% while maintaining target margins, update prices twice daily, handle seasonal variations, respect price floors and ceilings. Constraints: Integrate with existing ERP system, never price below cost, maintain competitive positioning, comply with pricing regulations.
Healthcare: Patient Triage Agent Purpose: Assess patient symptoms, determine urgency level, recommend appropriate care setting, schedule appointments. Users: Nurses, patients, clinic administrators. Success: 95% agreement with nurse assessment, reduce wait times by 30%, improve patient satisfaction scores, route urgent cases immediately. Constraints: HIPAA compliance, never delay emergency care, maintain empathetic communication, integrate with EHR systems, support multiple languages.
Manufacturing: Predictive Maintenance Agent Purpose: Monitor equipment health, predict failures, schedule preventive maintenance, optimize maintenance crew allocation. Users: Maintenance supervisors, plant managers, technicians. Success: Predict failures 30 days ahead with 85% accuracy, reduce unplanned downtime by 50%, optimize maintenance costs, extend equipment life by 20%. Constraints: Real-time sensor data processing, integrate with existing CMMS, work with legacy equipment, ensure safety compliance, support offline operation.
Agriculture: Crop Health Monitoring Agent Purpose: Monitor crop health using satellite imagery and ground sensors, detect disease outbreaks, recommend interventions, optimize irrigation and fertilization. Users: Farm managers, agronomists, cooperative managers. Success: Detect disease 7 days earlier than manual inspection, reduce water usage by 25%, increase yield by 15%, provide recommendations in local language. Constraints: Work with intermittent connectivity, support low-end mobile devices, integrate with existing farm management software, provide SMS alerts, operate across diverse crop types.

Each use case has unique requirements that shape the entire architecture. Spend time getting this step right.

Step 2: System Prompt Design

The system prompt is your agent’s operating manual. It defines personality, capabilities, limitations, and behavior patterns. A well-designed prompt ensures consistency, safety, and reliability. A poor prompt leads to unpredictable behavior that fails in production.

Crafting Clear Goals and Objectives

Start with what the agent should accomplish. Be specific. Be complete.

For a banking fraud analyst agent, goals might include: “Analyze transactions for fraud indicators. Calculate risk scores using provided scoring models. Cross-reference with known fraud patterns. Generate detailed investigation reports. Escalate high-risk cases immediately. Maintain professional tone in all communications. Never disclose investigation details to customers. Always provide reasoning for risk assessments.”

These goals establish scope and boundaries. The agent knows what to do and what to avoid.

For a retail customer service agent: “Resolve customer inquiries about orders, products, and policies. Access order history and inventory systems. Process returns within policy guidelines. Escalate refund requests over $500 to human agents. Maintain friendly, helpful tone. Never make promises outside company policies. Always offer alternatives when unable to fulfill requests. Collect feedback after each interaction.”

Notice the explicit boundaries. The agent cannot approve large refunds. It must escalate. It must stay within policies. These constraints prevent problematic behavior.

Defining Role and Persona

Persona affects consistency and user trust. A healthcare triage agent needs a calm, professional persona that inspires confidence. An educational tutoring agent needs an encouraging, patient persona that motivates learning. A technical support agent needs a knowledgeable, methodical persona that instills trust in solutions.

For an aviation maintenance agent: “You are an experienced aircraft maintenance engineer with 20 years of experience across multiple aircraft types. You prioritize safety above all else. You follow manufacturer guidelines strictly. You communicate clearly with technical precision. You never rush decisions when safety is involved. You document everything thoroughly for regulatory compliance.”

This persona shapes how the agent approaches problems, communicates findings, and makes decisions. It provides consistency across thousands of interactions.

For an agricultural extension agent serving smallholder farmers: “You are a knowledgeable agricultural advisor with deep understanding of local crops, climate, and farming practices. You communicate in simple, practical terms. You understand resource constraints farmers face. You provide actionable advice that works with available tools and budget. You respect traditional knowledge while introducing modern techniques. You are patient and encouraging.”

The persona must match the audience. Technical jargon works for expert users. Plain language works for general audiences. Cultural sensitivity matters for global deployments.

Writing Clear Instructions

Instructions tell the agent how to perform tasks. Vague instructions lead to inconsistent behavior. Specific instructions ensure reliability.

For a manufacturing quality control agent:

“When analyzing product images:

Load the image and extract features using the vision model.
Compare against quality specifications for this product type.
Identify any defects with bounding boxes and classification.
Calculate defect severity scores using the provided rubric.
If severity exceeds threshold, mark product as reject and log details.
If severity is borderline, request human review with supporting evidence.
Update quality metrics in the database.
If multiple defects of the same type appear in succession, alert supervisors immediately.”

These step-by-step instructions reduce ambiguity. The agent knows exactly what to do in each scenario.

For a healthcare patient triage agent:

“During patient assessment:

Collect symptoms, duration, and severity using the structured questionnaire.
Ask follow-up questions based on symptom clusters.
Check for red flag symptoms that require immediate care.
If red flags present, classify as emergency and route to immediate care pathway.
Calculate triage score using the approved algorithm.
Recommend appropriate care setting based on score and symptoms.
Schedule appointment if applicable.
Provide patient with clear next steps and timeline.
Never diagnose conditions or recommend specific treatments.
Always err on the side of caution for ambiguous cases.”

The “never diagnose” instruction is critical. It establishes clear boundaries that prevent liability issues.

Implementing Guardrails

Guardrails prevent harmful or problematic behavior. They are mandatory for production systems.

Content guardrails prevent inappropriate outputs. A customer service agent should never use profanity, even if the customer does. A financial advisory agent should never guarantee returns or make promises about market performance.

Safety guardrails prevent dangerous actions. An aviation agent should never recommend deferring safety-critical maintenance. A healthcare agent should never suggest delaying emergency care. A manufacturing agent should never disable safety interlocks.

Privacy guardrails prevent data leaks. A banking agent should never share customer information across accounts. A healthcare agent should never disclose patient information without authorization. A retail agent should never expose business-sensitive pricing algorithms.

For a banking fraud detection agent:

“Guardrails:

Never reveal fraud detection methods or models to any user.
Never disclose investigation details to account holders under investigation.
Never access accounts without legitimate fraud investigation purpose.
Never make final decisions on account closures without analyst approval.
Never share customer information across different investigations.
Always maintain audit logs of all account accesses.
If uncertain about regulatory compliance, escalate to human supervisor.”

These guardrails protect the bank, customers, and the integrity of fraud investigations.

For a healthcare triage agent:

“Guardrails:

Never provide specific diagnoses or treatment recommendations.
Always escalate chest pain, difficulty breathing, severe bleeding, or other emergency symptoms immediately.
Never recommend delaying care for serious symptoms to save costs.
Never access patient records without active patient interaction.
Always maintain empathetic, professional communication.
If patient expresses suicidal thoughts, immediately connect to crisis resources.
Never share patient information with unauthorized parties.
If uncertain about urgency level, err toward higher acuity.”

These guardrails protect patient safety and ensure appropriate care.

Prompt Iteration and Testing

Your first prompt will not be perfect. Test it with real scenarios. Find edge cases. Refine iteratively.

Start with a basic prompt. Test it with 20–30 representative scenarios. Document failures. Analyze patterns. Update the prompt. Repeat.

For a retail pricing agent, you might discover it recommends prices below cost during high-demand periods. Add an explicit constraint: “Never price products below cost plus 10% minimum margin.”

You might find a healthcare triage agent dismisses recurring headaches as minor when they could indicate serious conditions. Update instructions: “For recurring symptoms lasting more than two weeks, always recommend medical evaluation regardless of symptom severity.”

Testing reveals gaps in instructions, missing guardrails, and ambiguous guidance. This iterative refinement is not optional. It is how you build reliable production systems.

Multi-Industry Prompt Examples

Banking Fraud Analyst Agent

“You are a fraud detection specialist with 15 years of experience in financial crime investigation. Your role is to analyze transactions for fraud indicators and provide detailed risk assessments to fraud analysts.

Goals:

Analyze transaction patterns for anomalies and fraud indicators
Calculate risk scores using approved models and algorithms
Cross-reference transactions against known fraud databases
Generate comprehensive investigation reports with evidence
Escalate high-risk cases immediately to human analysts

Instructions: When analyzing a transaction:

Retrieve complete transaction history for the account
Check for velocity anomalies (unusual frequency or amounts)
Verify merchant and location consistency with customer profile
Calculate fraud probability using the ensemble model
If score exceeds 0.8, escalate immediately with detailed evidence
If score between 0.5–0.8, flag for analyst review with supporting data
If score below 0.5, clear transaction and log decision
Always document reasoning and data sources used

Guardrails:

Never reveal fraud detection methods to any user
Never discuss ongoing investigations with account holders
Never access accounts without legitimate investigation reason
Always maintain audit trail of all decisions
Escalate to human supervisor when uncertain about regulatory compliance
Never approve account closures without analyst confirmation”

Retail Customer Service Agent

“You are a helpful, friendly customer service representative for a premium retail brand. Your role is to assist customers with orders, products, and policies while maintaining excellent customer experience.

Goals:

Resolve customer inquiries quickly and effectively
Process returns and exchanges within policy guidelines
Provide accurate product information and recommendations
Maintain brand voice and customer satisfaction
Escalate complex issues appropriately

Instructions: For order inquiries:

Retrieve order details using order number or email
Provide current status and expected delivery date
If delayed, explain reason and offer alternatives
Process tracking updates and address concerns
For missing orders, escalate to fulfillment team

For returns:

Verify purchase within return window (30 days)
Confirm product condition meets return criteria
Process return authorization and provide label
Initiate refund once item received
Escalate refunds over $500 to manager approval

Guardrails:

Never approve refunds outside return policy without manager approval
Never make promises about delivery dates you cannot control
Never share business-sensitive information about suppliers or costs
Always maintain professional, empathetic tone even with frustrated customers
If customer becomes abusive, politely disengage and escalate
Never process returns without proper verification”

Healthcare Patient Triage Agent

“You are a compassionate, experienced triage nurse helping patients determine appropriate care settings. Your role is to assess symptoms, determine urgency, and guide patients to the right care level while ensuring safety.

Goals:

Accurately assess patient symptoms and urgency level
Route patients to appropriate care settings (emergency, urgent care, primary care, self-care)
Ensure immediate attention for emergency conditions
Reduce unnecessary emergency department visits
Maintain patient trust and satisfaction

Instructions: During assessment:

Collect chief complaint and symptom details using structured questions
Ask about symptom duration, severity, and progression
Check for red flag symptoms requiring immediate care
Screen for vital sign abnormalities if available
Calculate triage acuity using approved algorithm
Provide care recommendation with clear reasoning
Schedule appointment if appropriate
Give clear instructions for next steps
Provide return precautions (when to seek higher care level)

Red flag symptoms requiring emergency care:

Chest pain or pressure
Difficulty breathing or shortness of breath
Severe bleeding or trauma
Sudden severe headache or vision changes
Confusion or altered mental status
Signs of stroke (facial drooping, arm weakness, speech difficulty)
Severe allergic reaction
Suicidal or homicidal thoughts

Guardrails:

Never provide specific diagnoses or treatment recommendations
Always classify red flag symptoms as emergency regardless of other factors
Never recommend delaying emergency care due to cost or convenience
Always maintain empathetic, patient-centered communication
If patient mentions suicidal thoughts, immediately provide crisis resources
Never access records without active patient consent
When uncertain about urgency, always err toward higher acuity level
Escalate complex cases to nurse supervisor”

Notice how each prompt is tailored to the specific domain, user type, and risk profile. The banking agent focuses on investigation and compliance. The retail agent balances customer satisfaction with policy adherence. The healthcare agent prioritizes safety above all else.

Step 3: Choose the Right LLM

Model selection directly impacts performance, cost, latency, and capabilities. The wrong model choice undermines everything else.

Understanding Base Model Options

The landscape changes rapidly, but core trade-offs remain constant: capability versus cost, latency versus quality, hosted versus self-hosted, general-purpose versus specialized.

GPT-4 and GPT-4 Turbo (OpenAI) Strengths: Excellent reasoning, strong instruction following, good creative content generation, reliable function calling, extensive tool use capabilities. Weaknesses: Higher cost, moderate latency, requires OpenAI API dependency. Best for: Complex reasoning tasks, creative content, general-purpose agents, applications where quality matters more than cost.
Claude 3.5 and Claude 4.5 (Anthropic) Strengths: Superior long document analysis, very strong reasoning, excellent code generation, nuanced instruction following, 200K token context window, strong safety alignment. Weaknesses: Higher cost, API-only access, rate limits on free tier. Best for: Document analysis, research agents, coding assistants, applications requiring long context, use cases where safety and alignment matter.
Llama 3 70B and 405B (Meta) Strengths: Open source, self-hostable, no API costs, good performance, strong community support, multiple fine-tuned variants available. Weaknesses: Requires infrastructure for hosting, inference costs, less capable than frontier models, need technical expertise for deployment. Best for: On-premise deployments, data privacy requirements, high-volume applications where API costs prohibit other options, customization needs.
Gemini Pro and Ultra (Google) Strengths: Multimodal capabilities, strong reasoning, good integration with Google services, competitive pricing. Weaknesses: Less proven for production agent systems, smaller community ecosystem. Best for: Applications requiring vision, integration with Google Cloud, multimodal agents.
Domain-Specific Models Specialized models exist for healthcare (Med-PaLM), code (CodeLlama, StarCoder), finance, and other domains. These offer better performance for narrow use cases but lack general capabilities.

Parameter Tuning for Production

Temperature controls randomness. Low temperature (0.0–0.3) produces consistent, deterministic outputs. High temperature (0.7–1.0) produces creative, varied outputs.

For a banking fraud agent, use temperature 0.0. Consistency matters more than creativity. The same transaction should produce the same risk assessment every time.

For a customer service agent, use temperature 0.3–0.5. Allow some variation in phrasing to feel natural while maintaining consistent policy interpretation.

For creative content agents generating marketing copy or product descriptions, use temperature 0.7–0.9. Creativity and variety add value.

Top-p (nucleus sampling) provides another randomness control. Lower values (0.1–0.5) restrict to most likely tokens. Higher values (0.7–0.95) allow more diversity.

Max tokens controls output length. Set it based on your use case. A fraud investigation report needs 1000–2000 tokens. A customer service response needs 200–500 tokens. A simple API call response needs 50–100 tokens.

Tune parameters based on production data, not demo scenarios. Test with real user inputs. Measure quality metrics. Iterate.

Context Window Considerations

Context window determines how much information your agent can consider simultaneously. Longer context windows enable more sophisticated reasoning but increase cost and latency.

GPT-4 Turbo offers 128K tokens. Claude 3.5 offers 200K tokens. Llama 3 70B offers 8K tokens (some variants support 32K). Gemini Pro offers up to 1M tokens.

For a healthcare patient triage agent, 8K tokens suffices. Each interaction involves a few hundred tokens of conversation plus access to a few pages of clinical guidelines.

For a legal contract analysis agent, 200K tokens enables analyzing entire contracts with supporting case law and regulations in a single context window. This dramatically improves quality compared to chunking documents.

For a customer service agent with access to product catalogs, order history, and policy documents, 32K-64K provides good balance between capability and cost.

Consider your memory architecture (covered in Part 2) when selecting context window requirements. Vector databases can provide relevant information retrieval, reducing the need for massive context windows.

Balancing Cost, Latency, and Accuracy

Every production system faces trade-offs between these three factors.

Cost GPT-4 costs roughly $30 per million input tokens and $60 per million output tokens. Claude 3.5 Sonnet costs similar. Llama 3 70B has infrastructure costs but no per-token API fees.

For a high-volume retail customer service agent handling 100,000 interactions daily with average 500 tokens per interaction, API costs could reach $1,500 daily or $45,000 monthly. This might justify self-hosting Llama or using smaller models.

For a low-volume aviation maintenance agent processing 1,000 interactions monthly with complex analysis, API costs might total $500 monthly. Paying for GPT-4 quality is reasonable.

Latency First-token latency ranges from 200ms to 2 seconds depending on model size and hosting. Total response time includes model inference plus tool calls plus processing.

For real-time customer chat, latency above 3 seconds feels slow. Use faster models or streaming responses.

For batch processing fraud investigations, latency of 10–30 seconds per transaction is acceptable if accuracy improves.

For manufacturing quality control on assembly lines, latency under 2 seconds is mandatory. Use optimized deployment or smaller models.

Accuracy Larger models generally provide better accuracy, reasoning, and instruction following. The quality gap between GPT-4 and smaller models is measurable.

For high-stakes decisions (healthcare triage, financial fraud, safety-critical systems), pay for the best model. The cost of errors exceeds the cost of model quality.

For lower-stakes tasks (product recommendations, simple customer service, content categorization), smaller models often suffice.

Test rigorously with production data to measure actual quality differences for your use case. Perception often differs from reality.

Using Multiple Models Together

Production systems often use multiple models for different tasks.

A banking fraud detection system might use:

GPT-4 for complex fraud pattern analysis and investigation reports (high quality, low volume)
GPT-3.5 Turbo for customer notification messages (adequate quality, high volume, lower cost)
Fine-tuned Llama 3 8B for transaction classification (fast, cheap, task-specific)

A healthcare patient triage system might use:

Claude 3.5 for complex symptom analysis requiring long medical context (superior reasoning, long context)
GPT-4 for patient communication (natural, empathetic responses)
Specialized medical model for specific clinical decision support

A retail system might use:

GPT-4 for customer service conversations requiring reasoning
Embedding models for product search and recommendations
Fine-tuned classifier for simple routing decisions

Route requests to the appropriate model based on task requirements. Use cheap, fast models for simple tasks. Use expensive, powerful models for complex tasks. Measure cost and quality for your specific workload.

Multi-Industry Model Selection Examples

Banking Fraud Detection Primary: Claude 3.5 (200K context enables analyzing complete transaction history with regulatory guidelines) Secondary: GPT-4 Turbo for customer communications Reasoning: Fraud investigation requires deep analysis of patterns across long histories. Regulatory documents must be considered. Claude’s long context and strong reasoning justify higher cost.
Retail Inventory Optimization Primary: GPT-4 Turbo (good reasoning for demand forecasting) Secondary: Llama 3 70B for real-time pricing decisions (self-hosted for volume) Reasoning: Strategic planning requires quality reasoning. Real-time pricing decisions involve high volume where self-hosting becomes cost-effective.
Healthcare Patient Triage Primary: GPT-4 (proven reliability for safety-critical decisions) Fallback: Claude 3.5 for complex cases requiring long medical history analysis Secondary: Fine-tuned domain model for routine classification Reasoning: Patient safety demands highest quality for complex cases. Routine cases can use cheaper specialized models. Long medical histories benefit from Claude’s context window.
Manufacturing Quality Control Primary: GPT-4 Vision for image analysis Secondary: Llama 3 8B fine-tuned for defect classification (fast, on-premise) Reasoning: Real-time assembly line inspection requires low latency. Fine-tuned small model handles routine cases. GPT-4 Vision handles complex visual analysis for escalated cases.
Agriculture Crop Monitoring Primary: Llama 3 70B (self-hosted on edge devices) Secondary: GPT-4 for complex crop disease diagnosis requiring visual analysis Reasoning: Intermittent connectivity and remote locations favor self-hosted deployment. Most routine monitoring uses local model. Complex cases connect to cloud for higher quality analysis.
Aviation Maintenance Primary: GPT-4 (safety-critical decisions require highest quality) Backup: Claude 3.5 for analyzing maintenance manuals and regulations Reasoning: Aviation safety is non-negotiable. Use best available models. Regulatory compliance requires analyzing long documents. Redundancy and human oversight mandatory.

Model selection must account for your specific requirements: accuracy needs, latency constraints, volume economics, data privacy requirements, deployment environment, and risk tolerance.

Putting It All Together

The foundation architecture determines everything that follows. Define purpose clearly with measurable success criteria. Design prompts that ensure consistent, safe behavior. Select models that balance quality, cost, and latency for your specific use case.

These three steps represent 60% of your agent’s success. Skip them or rush through them, and you will build systems that fail in production. Invest time getting them right, and the remaining architectural components become straightforward.

In Part 2, we will cover infrastructure: tools and integrations, memory systems, and orchestration patterns. You will learn how to give your agent the capabilities, knowledge, and coordination logic it needs to operate autonomously.

We will explore tool architecture patterns that enable real-world actions. Memory systems that provide context and learning. Orchestration workflows that handle complex multi-step processes reliably.

Your engagement helps these guides reach practitioners who need them. If you found this valuable, please clap, comment with your industry experience, and share with your network. What challenges are you facing with agent design? What use cases are you building? Drop a comment below.

Follow me for Part 2, where we dive into the infrastructure layer that transforms a well-designed agent into a production system that delivers real business value.

Building AI Agents Part 1: Defining Purpose, Designing Prompts, and Selecting Models was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

The critical first steps that determine whether your AI agent succeeds or fails in production — with real examples from banking, retail, and healthcare

Step 1: Define Purpose and Scope

Identifying the Right Use Case

Understanding User Needs

Setting Success Criteria

Defining Constraints

Multi-Industry Use Case Examples

Step 2: System Prompt Design

Crafting Clear Goals and Objectives

Defining Role and Persona

Writing Clear Instructions

For a manufacturing quality control agent:

For a healthcare patient triage agent:

Implementing Guardrails

For a banking fraud detection agent:

For a healthcare triage agent:

Prompt Iteration and Testing

Multi-Industry Prompt Examples

Step 3: Choose the Right LLM

Understanding Base Model Options

Parameter Tuning for Production

Context Window Considerations

Balancing Cost, Latency, and Accuracy

Using Multiple Models Together

Multi-Industry Model Selection Examples

Putting It All Together

Leave a Comment