AIPython

LangSmith: Your Complete Guide to Debugging, Testing, and Evaluating LLM Applications

3/12/2026
10 min read

Introduction

Building production-grade LLM applications is fundamentally different from training models in research settings. When you deploy a language model into the wild—whether it's powering a customer support chatbot, a code generation tool, or a multi-step reasoning agent—you encounter a new class of problems: visibility, reproducibility, and evaluation.

Imagine you've built an AI system that handles customer inquiries. A customer reports that the system gave a nonsensical response, but you can't reproduce the issue. Why? Because the LLM's behavior is non-deterministic, the chain of prompts is complex, and you have no visibility into intermediate reasoning steps. This is where LangSmith enters the picture.

LangSmith is a comprehensive observability and debugging platform designed specifically for LLM applications. It's the missing operational layer that transforms your experimental notebooks into reliable, maintainable systems. In this article, we'll explore what LangSmith does, why it's essential for production LLM work, and how to integrate it into your workflow with practical examples.

The Problem: Why Traditional Debugging Doesn't Work for LLM Applications

Before diving into LangSmith, let's understand the unique challenges of debugging LLM-powered systems.

Non-Deterministic Behavior

Traditional software follows deterministic execution paths. Given the same input, a function returns the same output. Not so with LLMs. The same prompt might generate different outputs due to temperature settings, model updates, or sampling variations. This makes reproducing bugs incredibly difficult.

Invisible Complexity

A typical LLM application involves multiple layers of abstraction:

  • Prompt templates with variable interpolation
  • Chain-of-thought reasoning spans multiple LLM calls
  • Retrieval-augmented generation (RAG) with database lookups
  • Tool calls and function execution
  • Output parsing and validation

Each layer can fail silently, making root cause analysis a nightmare.

Evaluation Challenges

How do you measure if your LLM application is "working correctly"? Unlike traditional software with clear pass/fail test cases, LLM outputs require semantic evaluation. Is the response helpful? Is it factually accurate? Is it aligned with the user's intent?

Production Monitoring Blind Spots

Once deployed, how do you know if your system is degrading? What if the model provider releases a new version with different behavior? Without systematic logging and tracing, you're flying blind.

LangSmith addresses all these challenges by providing a structured approach to observation, debugging, and evaluation.

What Is LangSmith?

LangSmith is a platform for instrumenting, debugging, and evaluating LangChain and LangGraph applications (though it can work with any LLM framework). It operates on three core pillars:

  1. Tracing: Automatically capture every step of your LLM application
  2. Testing & Evaluation: Run systematic tests against your chains and compare performance metrics
  3. Monitoring: Track production behavior and catch regressions

Think of LangSmith as the APM (Application Performance Monitoring) tool for LLM applications—similar to how Datadog or New Relic work for traditional web services, but purpose-built for the unique characteristics of LLM systems.

Core Concept 1: Tracing and Observability

What Gets Traced?

LangSmith automatically captures:

  • LLM Calls: Every prompt sent to an LLM and every completion returned
  • Retriever Calls: Vector database lookups and document retrievals
  • Tool Usage: Function calls, API invocations, code execution
  • Parsing Operations: Token parsing, output validation
  • Chain Execution: Branching logic, loops, conditional flows
  • Timing Data: Latency at each step
  • Token Usage: Input/output tokens for cost tracking

This creates a complete execution trace—a visual tree of everything your application did.

Why This Matters

Consider a RAG (Retrieval-Augmented Generation) pipeline that retrieves documents and generates an answer. With tracing, you can answer questions like:

  • Which documents were retrieved? Are they relevant?
  • What prompt was actually sent to the LLM?
  • Where did the latency come from? Retrieval or generation?
  • Did the LLM use the retrieved context effectively?
  • How many tokens were used at each step?

Without tracing, you're guessing. With it, you have data.

Simple Tracing Example

Let's see how to instrument a basic LangChain application with LangSmith:

python
import os
from langsmith import traceable
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Set your LangSmith API key
os.environ["LANGSMITH_API_KEY"] = "your-api-key-here"
os.environ["LANGSMITH_PROJECT"] = "my-project"

# Create a traced function
@traceable(name="simple_query")
def answer_question(question: str) -> str:
    """Answer a question using GPT-4."""
    llm = ChatOpenAI(model="gpt-4", temperature=0)
    
    prompt = ChatPromptTemplate.from_template(
        "Answer the following question concisely:\n{question}"
    )
    
    chain = prompt | llm
    response = chain.invoke({"question": question})
    
    return response.content

# Call it—LangSmith will automatically trace the execution
result = answer_question("What is the capital of France?")
print(result)

When you run this code, LangSmith captures:

  • The function entry/exit
  • The LLM call with the exact prompt
  • Token counts and latency
  • The response

You can then view this trace in the LangSmith dashboard.

RAG Pipeline with Tracing

Here's a more complex example with retrieval:

python
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
from langsmith import traceable

@traceable(name="rag_pipeline")
def rag_query(question: str, vector_store: PineconeVectorStore) -> str:
    """Query with retrieval-augmented generation."""
    
    @traceable(name="retrieve_documents")
    def retrieve(q: str):
        # This nested trace shows document retrieval separately
        docs = vector_store.similarity_search(q, k=4)
        return docs
    
    # Get relevant documents
    docs = retrieve(question)
    
    # Create QA chain
    llm = ChatOpenAI(model="gpt-4")
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(search_kwargs={"k": 4})
    )
    
    answer = qa_chain.run(question)
    return answer

# Usage
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore(
    index_name="my-index",
    embedding_function=embeddings
)

result = rag_query("How do neural networks learn?", vectorstore)

The trace will show:

  • Documents retrieved and their relevance scores
  • The prompt constructed from those documents
  • The LLM's response
  • Any intermediate reasoning

Core Concept 2: Testing and Evaluation

Tracing gives you visibility. Testing gives you confidence. LangSmith's testing framework lets you define datasets and run your application against them systematically.

Dataset-Driven Testing

The idea is simple: create a dataset of inputs and expected outputs, run your application against them, and measure performance.

python
from langsmith import Client

client = Client()

# Create a dataset
dataset_name = "customer_support_qa"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs for customer support chatbot"
)

# Add examples to the dataset
examples = [
    {
        "input": "How do I reset my password?",
        "expected_output": "You can reset your password by clicking 'Forgot Password' on the login page."
    },
    {
        "input": "What's your return policy?",
        "expected_output": "We offer 30-day returns on all items in original condition."
    },
    {
        "input": "How long does shipping take?",
        "expected_output": "Standard shipping takes 5-7 business days."
    },
]

for example in examples:
    client.create_example(
        dataset_id=dataset.id,
        inputs={"question": example["input"]},
        outputs={"answer": example["expected_output"]}
    )

Evaluators: Measuring Quality

For LLM outputs, we need custom evaluators because exact string matching won't work. LangSmith provides a framework for this:

python
from langsmith import evaluate

def answer_relevance(run, example) -> dict:
    """Check if the output is relevant to the input."""
    # run.outputs contains the actual output
    # example.outputs contains expected output
    
    output = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    # Use an LLM to judge relevance
    llm = ChatOpenAI(model="gpt-4")
    prompt = f"""
    Question: {example.inputs['question']}
    Expected answer: {expected}
    Actual answer: {output}
    
    Is the actual answer relevant to the question? 
    Respond with 'yes' or 'no' only.
    """
    
    response = llm.invoke(prompt).content.lower().strip()
    return {"score": 1 if "yes" in response else 0}

def exact_match(run, example) -> dict:
    """Check for exact string match (strict)."""
    output = run.outputs.get("answer", "").lower().strip()
    expected = example.outputs.get("answer", "").lower().strip()
    return {"score": 1 if output == expected else 0}

def contains_keywords(run, example) -> dict:
    """Check if output contains key information."""
    output = run.outputs.get("answer", "").lower()
    expected = example.outputs.get("answer", "").lower()
    
    # Extract words from expected output
    keywords = set(expected.split())
    
    # Count how many keywords are in the actual output
    matches = sum(1 for keyword in keywords if keyword in output)
    score = matches / len(keywords) if keywords else 0
    
    return {"score": score}

Running an Evaluation

Now we can run our application against the test dataset and evaluate results:

python
def support_chatbot(question: str) -> dict:
    """Our chatbot implementation."""
    llm = ChatOpenAI(model="gpt-4", temperature=0)
    
    system_prompt = """You are a helpful customer support agent. 
    Answer questions about our products, policies, and services accurately 
    and concisely."""
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("user", "{question}")
    ])
    
    chain = prompt | llm
    response = chain.invoke({"question": question})
    
    return {"answer": response.content}

# Run evaluation
experiment_results = evaluate(
    support_chatbot,
    data=dataset_name,
    evaluators=[answer_relevance, exact_match, contains_keywords],
    experiment_prefix="chatbot_v1"
)

# Results are available in LangSmith dashboard
print(f"Experiment: {experiment_results}")

The dashboard shows:

  • Pass/fail rate for each evaluator
  • Detailed results for each test case
  • Failed examples for debugging
  • Comparison with previous runs

Core Concept 3: Monitoring and Feedback Loops

Production is where things get interesting (and scary). LangSmith provides monitoring capabilities to track your application's health in the wild.

Automatic Tracing in Production

Once deployed, your LangChain application automatically logs to LangSmith without code changes:

python
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Set environment variables
os.environ["LANGSMITH_API_KEY"] = "your-key"
os.environ["LANGSMITH_PROJECT"] = "production"
os.environ["LANGSMITH_TRACING"] = "true"  # Enable production tracing

# Your app runs normally, but everything is traced
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm

response = chain.invoke({"question": "What is AI?"})

Collecting User Feedback

The most valuable feedback comes from users. LangSmith lets you attach feedback to traces:

python
from langsmith import Client

client = Client()

# After generating a response, collect user feedback
def save_feedback(trace_id: str, user_rating: int, user_comment: str = ""):
    """Save user feedback to a trace."""
    client.create_feedback(
        run_id=trace_id,
        key="user_rating",
        score=user_rating,  # 1-5 stars
        comment=user_comment
    )

# In your application:
# 1. Get the trace ID (automatically available in LangChain)
# 2. Show feedback button to user
# 3. Save feedback when user clicks

# Example in a Flask app:
from flask import Flask, request
app = Flask(__name__)

@app.route("/feedback", methods=["POST"])
def feedback():
    trace_id = request.json["trace_id"]
    rating = request.json["rating"]
    comment = request.json.get("comment", "")
    
    save_feedback(trace_id, rating, comment)
    return {"status": "ok"}

Analyzing Production Patterns

With production traces and feedback, you can identify patterns:

python
from langsmith import Client

client = Client()

# Get all traces from the last 24 hours
runs = client.list_runs(
    project_name="production",
    filter='gt(created_at, "2024-01-15")',
    limit=1000
)

# Analyze failure patterns
failures = [r for r in runs if r.status == "error"]
low_ratings = [r for r in runs if r.feedback and r.feedback[0].score <= 2]

# Print insights
print(f"Total runs: {len(list(runs))}")
print(f"Errors: {len(failures)}")
print(f"Low user ratings: {len(low_ratings)}")

# Identify common failure patterns
error_types = {}
for run in failures:
    error = run.error or "unknown"
    error_types[error] = error_types.get(error, 0) + 1

print("\nMost common errors:")
for error, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {error}: {count}")

Advanced Pattern: Multi-Step Agent Tracing

LangSmith really shines with complex agents that make multiple LLM calls, use tools, and have branching logic. Here's a more realistic example:

python
from langchain.agents import Tool, initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langsmith import traceable

# Define tools for the agent
def search_api(query: str) -> str:
    """Search our knowledge base."""
    # Placeholder—would call real search API
    return f"Results for '{query}': [found relevant documents]"

def calculate_metric(metric_name: str, data: str) -> str:
    """Calculate a business metric."""
    # Placeholder—would call real calculation
    return f"Calculated {metric_name}: 42.5"

tools = [
    Tool(
        name="search",
        func=search_api,
        description="Search the knowledge base for information"
    ),
    Tool(
        name="calculate",
        func=calculate_metric,
        description="Calculate business metrics"
    ),
]

@traceable(name="intelligent_agent")
def run_agent(task: str) -> str:
    """An intelligent agent that can use tools and reason."""
    
    ll

Share this article

Chalamaiah Chinnam

Chalamaiah Chinnam

AI Engineer & Senior Software Engineer

15+ years of enterprise software experience, specializing in applied AI systems, multi-agent architectures, and RAG pipelines. Currently building AI-powered automation at LinkedIn.