AI Deep Dive

The AI Hierarchy — What Fits Where

Every AI buzzword maps to a specific layer in a hierarchy. Understanding this hierarchy is the single most important thing before you walk into any AI conference.

ARTIFICIAL INTELLIGENCE — Machines that mimic human cognition MACHINE LEARNING — Systems that learn from data, not rules DEEP LEARNING — Neural networks with many layers LLMs GPT, Claude, Gemini, Llama Vision Models Image gen, recognition Speech Models Whisper, TTS, ASR AI AGENTS — LLMs + Tools + Memory + Autonomy (the Salesforce focus)

Key Distinction: AI vs ML vs Deep Learning

ConceptWhat It IsExample
AIAny system that performs tasks typically requiring human intelligence. Includes rule-based systems.A chess engine with hardcoded rules. An if-else fraud filter.
Machine LearningA subset of AI. The system learns patterns from data instead of being explicitly programmed.Spam filter that learns from labeled emails. Recommendation engines.
Deep LearningA subset of ML using neural networks with many layers. Excels at unstructured data (text, images, audio).ChatGPT, image recognition, voice assistants.
Generative AIA subset of DL that creates new content — text, images, code, audio.Claude writing an email. DALL-E generating an image.

CTO mental model: All generative AI is deep learning. All deep learning is ML. All ML is AI. But not all AI is ML — rule-based expert systems are AI but not ML.

Types of Machine Learning

📊
Supervised Learning

Learns from labeled examples. Input → known output. Used for: classification (spam/not spam), regression (price prediction), forecasting.

🔍
Unsupervised Learning

Finds patterns in unlabeled data. No "right answer" given. Used for: customer segmentation, anomaly detection, clustering.

🎮
Reinforcement Learning

Agent learns by trial-and-error, maximizing a reward signal. Used for: game AI, robotics, ad bidding, RLHF for LLMs.

🔄
Self-Supervised Learning

The model creates its own labels from the data. "Predict the next word." This is how LLMs like Claude are pre-trained.

How AI Actually Learns — From Data to Intelligence

Neural Networks: The Core Mechanism

A neural network is a function that takes input numbers and produces output numbers, with adjustable parameters (called weights) in between. "Learning" means adjusting those weights to minimize errors.

The Training Loop (every AI model follows this)

  1. 1Forward pass: Feed input data through the network. It produces a prediction.
  2. 2Loss calculation: Compare the prediction to the correct answer. Measure how wrong it was (the "loss").
  3. 3Backpropagation: Calculate how each weight contributed to the error.
  4. 4Weight update: Adjust weights slightly to reduce the error (using gradient descent).
  5. 5Repeat: Do this billions of times across the entire dataset. Each full pass = one "epoch."

What Makes Deep Learning "Deep"

A shallow network has 1-2 hidden layers. A deep network has dozens to hundreds. Each layer learns increasingly abstract features:

Layer 1: Raw patterns (edges, individual characters)
Layer 2-5: Combinations (shapes, words, phrases)
Layer 10-20: Concepts (objects, syntax, meaning)
Layer 50+: Abstract reasoning (intent, context, nuance)

The Transformer Architecture (2017 — the breakthrough)

Before transformers, AI processed text word-by-word sequentially (slow, forgetful). The transformer introduced self-attention: the model can look at all words in a sentence simultaneously and learn which words relate to which.

Why it matters: Every major LLM today — GPT-4, Claude, Gemini, Llama — is a transformer. The 2017 Google paper "Attention Is All You Need" is the single most important AI paper of the decade.

Key Numbers (to have in your back pocket)

MetricWhat It MeansTypical Values
ParametersThe adjustable weights in the model. More ≈ more capacity to learn.GPT-4: ~1.8T, Llama 3: 8B-405B, Claude: undisclosed
Context WindowHow much text the model can "see" at once (input + output).Claude: 200K tokens. GPT-4: 128K. Gemini: 1M+
TokensChunks of text (~0.75 words per token). The unit of measurement for LLMs.This entire page ≈ 4,000 tokens
Training DataTotal text the model was trained on.Typically trillions of tokens from books, web, code
InferenceRunning the trained model to generate a response. What you pay for via API.~$3-15 per million input tokens (varies by model)

Large Language Models — The Engine Behind Everything

What an LLM Actually Does

An LLM is a next-token predictor. Given a sequence of tokens, it predicts the probability distribution over all possible next tokens, then samples from that distribution. That's it. All the apparent "intelligence" emerges from doing this prediction extremely well over extremely large amounts of data.

The Three Phases of Building an LLM

Phase 1: Pre-training
Phase 2: Fine-tuning
Phase 3: RLHF / Alignment

Phase 1 — Pre-training (costs $10M-$100M+): Feed the model trillions of tokens of text from the internet, books, code. The model learns language, facts, reasoning patterns. This produces a "base model" — it can complete text but won't follow instructions.

Phase 2 — Fine-tuning (SFT): Train on curated instruction/response pairs. "When a human asks X, a good response looks like Y." This teaches the model to be a helpful assistant instead of just an autocomplete engine.

Phase 3 — RLHF (Reinforcement Learning from Human Feedback): Human raters rank multiple model responses. A reward model learns what humans prefer. The LLM is then trained to maximize that reward. This is what makes Claude polite, safe, and genuinely useful vs. just technically capable.

Key LLM Capabilities

In-Context Learning

Give the LLM examples in the prompt, it adapts behavior without retraining. "Few-shot" prompting.

Chain of Thought

Ask it to "think step by step" and accuracy on reasoning tasks jumps dramatically.

Tool Use / Function Calling

The LLM outputs structured JSON to call external APIs, databases, or tools. This is the foundation of agents.

RAG (Retrieval-Augmented Generation)

Before answering, retrieve relevant documents from a database and inject them into the context. Reduces hallucination. Keeps answers grounded in your data.

What LLMs Cannot Do (important for a CTO)

No true memory: Each conversation starts fresh unless you engineer persistence.

Hallucinations: They confidently state false things. This is inherent to probabilistic generation. Mitigation = RAG, grounding, verification layers.

No real-time data: Knowledge is frozen at training cutoff unless you add search/retrieval tools.

Math and precise logic: Unreliable for complex calculations without tool use. They approximate; they don't compute.

Determinism: Same input can produce different outputs. Temperature controls randomness but never eliminates it.

The AI Provider Landscape

Foundation Model Providers

ProviderModelsStrengthsAccess
AnthropicClaude (Opus, Sonnet, Haiku)Safety, long context (200K), instruction following, coding, analysisAPI, claude.ai, AWS Bedrock, GCP Vertex
OpenAIGPT-4o, o1, o3Broad capabilities, vision, ecosystem, first-mover brandAPI, ChatGPT, Azure OpenAI
GoogleGemini (Ultra, Pro, Flash)Multimodal, huge context (1M+), integrated with Google CloudAPI, Gemini app, GCP Vertex
MetaLlama 3/4Open-source, self-hostable, fine-tunable, no vendor lock-inDownload weights, run anywhere
MistralMixtral, Mistral LargeEuropean, efficient, open-weight optionsAPI, self-host

How Claude is Different

Anthropic's approach centers on "Constitutional AI" — the model is trained with a set of principles (a "constitution") rather than relying purely on human labeling. This produces more consistent, principled behavior.

Practical differences: Claude tends to follow complex instructions more precisely, handles very long documents well (200K context), and is more cautious about harmful outputs. For enterprise use: Claude is available on AWS Bedrock and GCP Vertex, so you can keep data within your cloud perimeter.

Open Source vs Closed Source — CTO Decision Framework

FactorClosed (Claude, GPT-4)Open (Llama, Mistral)
PerformanceGenerally best-in-classClosing the gap rapidly
CostPay per token (API)Infra cost (GPUs) — can be cheaper at scale
Data privacyData sent to provider's API (though enterprise tiers offer zero retention)Runs on your infra — full control
CustomizationPrompt engineering, some fine-tuningFull fine-tuning, modify architecture
MaintenanceProvider handles everythingYou own ops, updates, security
Best forFast deployment, best quality, small-medium scaleHigh volume, strict compliance, niche domains

AI Agents — The Salesforce Conference Focus

What is an AI Agent?

An AI agent is an LLM that can plan, use tools, observe results, and iterate — autonomously. Instead of just answering a question, it takes action to accomplish a goal.

LLM "Brain" Reasons, plans, decides 🔧 Tools / APIs 🗄️ Databases 🌐 Web / Search 📧 Email / CRM 🧠 Memory Store 📋 Planning / Goals 👤 User / Trigger

Agent vs Automation vs Chatbot — The Critical Distinction

FeatureTraditional Automation (RPA)Chatbot (rule-based)AI Chatbot (LLM)AI Agent
Decision makingNone. Follows fixed rules.Decision tree onlyFlexible, but single-turnPlans multi-step, adapts
Handles ambiguityNo — breaks on edge casesNoYesYes
Uses toolsHardcoded integrationsNoIf programmed toAutonomously decides which tools
MemoryNoneSession onlySession onlyShort + long-term memory
AutonomyZeroZeroLowHigh — can loop, retry, escalate
ExampleCopy data between systems on schedule"Press 1 for billing""Summarize this ticket""Resolve this ticket: read it, check the database, update the CRM, email the customer"

Salesforce context: When Salesforce says "Agentforce," they mean AI agents embedded into Salesforce workflows — agents that can read your CRM data, take actions (create cases, send emails, update records), and operate within the guardrails you define. These sit on top of their Einstein AI platform + Data Cloud.

The Agent Loop (ReAct Pattern)

Most agent frameworks follow this loop:

  1. 1Observe: Receive user request or trigger event. Retrieve relevant context from memory.
  2. 2Think: The LLM reasons about what to do next. It creates a plan or picks the next action.
  3. 3Act: Call a tool — query a database, call an API, send a message, update a record.
  4. 4Observe: Check the result of the action. Did it work? Was the data correct?
  5. 5Loop or stop: If the goal is met, respond to the user. If not, go back to step 2 with updated context.

Agent Architecture Components

🧠
LLM Core

The reasoning engine. Chooses actions, interprets results, generates responses. Claude, GPT-4, etc.

🔧
Tools

Functions the agent can call: APIs, database queries, web search, calculators, file operations. Defined as schemas the LLM can invoke.

💾
Memory

Short-term: Conversation context. Long-term: Vector database storing past interactions, user preferences, knowledge base.

🛡️
Guardrails

Rules that constrain what the agent can do. "Never delete records." "Always get approval for refunds > $500." "Escalate to human if confidence is low."

📊
Orchestrator

Manages the agent loop. Handles retries, timeouts, error handling, and routing between multiple agents.

👁️
Observability

Logging every step: what the agent thought, what tools it called, what it returned. Critical for debugging and compliance.

Building AI Applications — Practical Architecture

The AI Application Stack

User Interface — Chat, voice, embedded UI, API endpoints
Application Layer — Business logic, auth, rate limiting, caching
Orchestration — Agent framework, prompt management, tool routing
RAG Pipeline — Document ingestion, embeddings, vector search, reranking
Model Layer — LLM API calls (Claude, GPT-4, etc.) or self-hosted models
Infrastructure — GPUs, vector DB, object storage, monitoring

RAG: The Most Common Enterprise AI Pattern

RAG (Retrieval-Augmented Generation) is how you make an LLM answer questions about your data without retraining the model.

How RAG Works — Step by Step

  1. 1Ingest documents: Take your internal docs (PDFs, Confluence pages, Slack messages, CRM data). Split them into chunks (typically 200-500 tokens each).
  2. 2Create embeddings: Run each chunk through an embedding model (e.g., OpenAI text-embedding-3, Cohere Embed). This converts text to a numerical vector (a list of ~1500 numbers) that captures semantic meaning.
  3. 3Store in vector database: Store those vectors in a vector DB (Pinecone, Weaviate, pgvector, Qdrant, Chroma). This enables fast similarity search.
  4. 4At query time: Convert the user's question to a vector using the same embedding model.
  5. 5Search: Find the top 5-20 most similar document chunks in your vector DB.
  6. 6Augment: Insert those chunks into the LLM prompt as context: "Given this information: [chunks], answer the user's question: [question]."
  7. 7Generate: The LLM answers based on the retrieved context, dramatically reducing hallucination.

RAG pitfall: "Garbage in, garbage out" applies fully. If your documents are messy, poorly structured, or out of date, RAG will confidently retrieve bad information. Data quality work is 60% of a RAG project.

How IT Vendors / SaaS Companies Build AI Into Their Products

When Salesforce, ServiceNow, SAP, or any SaaS vendor says "we have AI," here's what they typically do:

Approach 1: Embed an LLM via API

The vendor calls Claude/GPT-4 via API and wraps it with their product context. Your CRM data is injected into prompts. The vendor manages prompt engineering, guardrails, and tool definitions. Example: Salesforce Einstein GPT uses a combination of proprietary models and partner LLMs.

Approach 2: Fine-tune on Domain Data

Take a base model, fine-tune it on industry-specific data (healthcare records, legal contracts, financial reports). This creates a specialized model that understands domain jargon and patterns. Example: Bloomberg trained BloombergGPT on financial data.

Approach 3: Build a RAG Layer Over Customer Data

The vendor builds a RAG pipeline that indexes each customer's data (behind proper isolation/permissions). When the AI answers, it's grounded in that customer's specific data. Example: Salesforce Data Cloud feeds into Einstein AI as a retrieval layer.

Approach 4: Train Custom Models

Major vendors train their own models from scratch for specific tasks — fraud detection, recommendation engines, demand forecasting. These are typically smaller, specialized models, not general-purpose LLMs.

Practical: Building an AI Feature (Simplified)

Say you want to build: "An AI that answers customer questions using your knowledge base."

StepWhat You DoTools/Services
1. Data prepExport knowledge base articles, clean HTML, split into chunksPython, LangChain text splitters, Unstructured.io
2. EmbeddingsGenerate vector embeddings for each chunkOpenAI Embeddings API, Cohere, Voyage AI
3. Vector storeStore embeddings with metadata (article ID, title, date)Pinecone, Weaviate, pgvector (Postgres), Qdrant
4. RetrievalBuild search endpoint: query → vector → top-k similar chunksVector DB SDK + reranking (Cohere Rerank)
5. PromptSystem prompt with role, rules, tone + injected context chunksPrompt templating (LangChain, custom)
6. LLM callSend assembled prompt to Claude/GPT-4 APIAnthropic API, OpenAI API
7. UIChat interface with streaming responsesReact, Vercel AI SDK, Streamlit
8. GuardrailsInput validation, output filtering, content policy enforcementCustom rules, Guardrails AI, Anthropic's built-in safety
9. ObservabilityLog every request, response, retrieval, latency, costLangSmith, Langfuse, Helicone, Datadog
10. EvaluationMeasure answer quality, hallucination rate, user satisfactionHuman review, LLM-as-judge, RAGAS framework

Key Frameworks & Tools You'll Hear About

LangChain

Python/JS framework for building LLM applications. Chains together prompts, tools, memory, retrievers. Widely used but can be over-abstracted for simple use cases.

LlamaIndex

Focused specifically on RAG. Better than LangChain for document indexing and retrieval pipelines. Good for "connect your data to LLMs."

CrewAI / AutoGen

Multi-agent frameworks. Define multiple agents with different roles that collaborate on complex tasks. Still experimental for production.

Vercel AI SDK

Lightweight SDK for building AI chat UIs in Next.js/React. Handles streaming, tool calls, multi-model support. Good for frontend devs.

Voice AI — Calls, IVR, and Conversational Agents

How Voice AI Works End-to-End

🎤 User speaks
ASR/STT
LLM processes text
TTS
🔊 User hears response

ASR (Automatic Speech Recognition) / STT (Speech-to-Text): Converts spoken audio to text. Leading models: OpenAI Whisper (open source, very good), Google Cloud Speech, Deepgram (low latency, real-time), AssemblyAI.

LLM Processing: Same as any text-based AI. The transcribed text is the input. The LLM generates a text response.

TTS (Text-to-Speech): Converts the LLM's text response back to natural-sounding speech. Leading options: ElevenLabs (most natural), OpenAI TTS, Google Cloud TTS, Play.ht, Cartesia (ultra-low latency).

Voice AI Use Cases

Welcome / Outbound Calls

AI calls new customers to welcome them, walk through onboarding, collect preferences. Runs 24/7. Companies like Bland.ai and Retell AI provide turn-key platforms.

Customer Support IVR

Replace "Press 1 for billing" with natural conversation. AI understands intent, pulls up account data, resolves issues or routes to human. Huge cost savings.

Appointment Scheduling

AI calls to confirm/reschedule appointments. Handles back-and-forth negotiation on times. Used in healthcare, salons, auto services.

Sales Qualification

AI calls inbound leads, asks qualifying questions, logs data to CRM, books meetings for human reps. Example: Air AI, Vapi.

Building a Voice AI System — Key Decisions

DecisionOptionsTrade-off
Latency target<500ms feels natural, >1s feels roboticLower latency = more expensive, requires streaming ASR+TTS
Build vs buyPlatforms: Vapi, Retell, Bland.ai, Vocode. Build: Twilio + ASR + LLM + TTSPlatforms are faster to ship. Custom gives control over every component.
Interruption handlingMust detect when user starts speaking mid-response and stop gracefullyHard to get right. Platforms handle this. DIY requires VAD (Voice Activity Detection).
Phone integrationTwilio, Vonage, Plivo for SIP/PSTNTwilio is most mature. Costs per minute apply.

Enterprise AI — Practical Considerations

Data Privacy & Security

Zero Data Retention (ZDR): Enterprise API tiers from Anthropic and OpenAI guarantee your data is not used for training and is not retained after the request. Verify this in your contract.

Data residency: Run Claude via AWS Bedrock in your preferred AWS region. Data never leaves your VPC. Same with Google Vertex AI.

PII handling: Either redact PII before sending to the LLM, or use enterprise tiers with appropriate data processing agreements.

SOC2 / HIPAA / GDPR: Check provider compliance certifications. Anthropic and OpenAI have SOC2 Type II. HIPAA requires BAA agreements.

Cost Management

LLM API Pricing Model

You pay per token — both input (prompt) and output (response). A 2000-word document as input + a 500-word response ≈ 3500 tokens ≈ $0.01-0.05 depending on model.

Cost Optimization Strategies

Model routing: Use a cheap/fast model (Claude Haiku, GPT-4o mini) for simple queries. Route complex queries to the expensive model (Claude Opus, GPT-4). This alone can cut costs 60-80%.

Caching: Cache frequent queries. Anthropic offers prompt caching — if the same system prompt is reused, you pay a fraction of the cost.

Shorter prompts: Every token in your system prompt is charged on every request. Optimize for brevity.

Batch processing: For non-real-time tasks, use batch APIs at 50% discount.

Evaluation — How to Know If Your AI Is Good

Accuracy / Correctness

Are the answers factually correct? Measure with human review + automated checks against known answers.

Hallucination Rate

How often does it make things up? Use "faithfulness" metrics — does the answer only contain claims supported by the retrieved context?

Latency

Time-to-first-token and total response time. Users expect <2s for first token in chat UIs.

Cost per Query

Track input tokens, output tokens, embedding calls, vector DB queries per request. Set budgets.

Team Structure for AI

RoleWhat They DoTypical Background
AI/ML EngineerBuilds pipelines, integrates LLMs, manages RAG, fine-tuningSoftware engineer + ML experience
Prompt EngineerDesigns and optimizes system prompts, evaluates output qualityDomain expert + writing skill (often non-technical)
Data EngineerPrepares, cleans, and pipelines data for RAG / trainingData engineering, ETL, data quality
Platform/InfraManages GPUs, vector DBs, model serving, observabilityDevOps / SRE with ML infra experience

Salesforce AI Ecosystem — Conference Context

Key terms you'll hear at the conference: Agentforce, Einstein GPT, Data Cloud, Prompt Builder, Model Builder, Trust Layer, Einstein Copilot.

Salesforce AI Architecture

Agentforce — Pre-built + custom AI agents for sales, service, marketing
Einstein Copilot — Conversational AI assistant inside Salesforce UI
Einstein GPT + Prompt Builder — Customizable AI generation in flows
Trust Layer — Prompt defense, toxicity filtering, PII masking, audit logging
Data Cloud — Unified customer data, real-time, feeds into AI as context
Foundation Models — Salesforce's own + OpenAI, Anthropic, Google, Cohere via gateway

What Salesforce "Agentforce" Actually Is

Agentforce = a platform for building and deploying AI agents inside Salesforce. Think of it as:

LLM (multi-model — they route to different providers) + Salesforce data (CRM, Data Cloud, Knowledge Base) + Tools (Salesforce actions: create case, update opportunity, send email) + Guardrails (Trust Layer + your business rules) + Deployment (embed in Service Cloud, Sales Cloud, web, Slack, etc.)

You can build agents using low-code tools (Agent Builder) or code (Apex, LWC). Salesforce provides pre-built agent templates for: Service Agent, Sales Coach, Merchant Agent, Buyer Agent, Campaign Agent.

What Questions to Ask at the Conference

• "How does Agentforce handle multi-step tool failures and retries?"

• "What's the latency overhead of the Trust Layer on each LLM call?"

• "Can I bring my own model (BYOM) and still use the orchestration layer?"

• "How does Data Cloud grounding work — is it RAG under the hood? What embedding model?"

• "What's the pricing model — per agent, per conversation, per action?"

• "How do I evaluate agent quality? Is there built-in testing/evaluation tooling?"

• "What observability do I get — can I see every reasoning step, tool call, and retrieval?"

Risks, Limitations & What Can Go Wrong

Hallucination

LLMs generate plausible-sounding false information. Mitigation: RAG, citations, confidence scoring, human-in-the-loop for high-stakes decisions.

Prompt Injection

Malicious users craft inputs that override system instructions. "Ignore previous instructions and..." Mitigation: input sanitization, separate system/user contexts, output validation.

Data Leakage

Model accidentally reveals training data or other users' data. Mitigation: data isolation, output filtering, PII redaction, tenant separation.

Bias

Models inherit biases from training data. Can discriminate in hiring, lending, customer service. Mitigation: bias testing, diverse training data, human oversight.

Cost Overruns

Token costs can spike unexpectedly. One rogue loop in an agent can burn through budget. Mitigation: per-request budgets, circuit breakers, cost monitoring.

Vendor Lock-in

Building on one provider's API creates dependency. Mitigation: abstract the model layer, use frameworks that support model switching.

CTO Strategy — Where to Start

The Pragmatic Adoption Ladder

  1. 1Internal productivity (low risk, high value): Deploy an LLM chatbot connected to your internal knowledge base. Let employees ask questions about policies, docs, procedures. This is the quickest win with the least risk.
  2. 2Customer-facing copilot (medium risk): AI that helps customers with common questions, grounded in your help center. Always with a "talk to human" escape hatch.
  3. 3Process automation agents (higher risk): Agents that take actions — update records, send emails, process refunds. Requires guardrails, approval workflows, thorough testing.
  4. 4Autonomous agents (highest complexity): Multi-step agents that handle end-to-end workflows with minimal human oversight. Only after you have robust evaluation, monitoring, and rollback capabilities.

The #1 mistake CTOs make: Starting with a model choice instead of starting with the problem. Pick a specific, measurable business problem first. Then figure out which AI approach solves it. Often you don't need the most expensive model — or any model at all.

Quick Reference: The AI Glossary

TermPlain English
TokenA chunk of text (~¾ of a word). The unit LLMs process and bill by.
EmbeddingConverting text to a list of numbers that capture meaning. Similar texts → similar numbers.
Vector DatabaseA database optimized for storing and searching embeddings by similarity.
RAGRetrieval-Augmented Generation. Look up relevant docs, then feed them to the LLM.
Fine-tuningAdditional training on specific data to specialize a model. Expensive, usually unnecessary.
Prompt EngineeringCrafting the instructions (system prompt) to get the best output from an LLM.
TemperatureControls randomness. 0 = deterministic, 1 = creative. Use low for facts, higher for brainstorming.
Context WindowMax text the model can process at once. Bigger = more info per request, but more expensive.
RLHFReinforcement Learning from Human Feedback. How models learn to be helpful vs. just technically correct.
InferenceRunning a trained model to get a prediction/response. The thing you pay for in production.
HallucinationWhen the model confidently outputs false information.
GuardrailsRules that constrain AI behavior — what it can/can't do, say, or access.
MCPModel Context Protocol. An open standard (by Anthropic) for connecting LLMs to external tools and data sources.
Function CallingThe LLM outputs structured data to invoke external tools/APIs. The mechanism behind agents.
AgenticAI that can plan, act, observe, and loop — not just respond to a single prompt.