LangSmith: Your Complete Guide to Debugging, Testing, and Evaluating LLM Applications
Introduction
Building production-grade LLM applications is fundamentally different from training models in research settings. When you deploy a language model into the wild—whether it's powering a customer support chatbot, a code generation tool, or a multi-step reasoning agent—you encounter a new class of problems: visibility, reproducibility, and evaluation.
Imagine you've built an AI system that handles customer inquiries. A customer reports that the system gave a nonsensical response, but you can't reproduce the issue. Why? Because the LLM's behavior is non-deterministic, the chain of prompts is complex, and you have no visibility into intermediate reasoning steps. This is where LangSmith enters the picture.
LangSmith is a comprehensive observability and debugging platform designed specifically for LLM applications. It's the missing operational layer that transforms your experimental notebooks into reliable, maintainable systems. In this article, we'll explore what LangSmith does, why it's essential for production LLM work, and how to integrate it into your workflow with practical examples.
The Problem: Why Traditional Debugging Doesn't Work for LLM Applications
Before diving into LangSmith, let's understand the unique challenges of debugging LLM-powered systems.
Non-Deterministic Behavior
Traditional software follows deterministic execution paths. Given the same input, a function returns the same output. Not so with LLMs. The same prompt might generate different outputs due to temperature settings, model updates, or sampling variations. This makes reproducing bugs incredibly difficult.
Invisible Complexity
A typical LLM application involves multiple layers of abstraction:
- Prompt templates with variable interpolation
- Chain-of-thought reasoning spans multiple LLM calls
- Retrieval-augmented generation (RAG) with database lookups
- Tool calls and function execution
- Output parsing and validation
Each layer can fail silently, making root cause analysis a nightmare.
Evaluation Challenges
How do you measure if your LLM application is "working correctly"? Unlike traditional software with clear pass/fail test cases, LLM outputs require semantic evaluation. Is the response helpful? Is it factually accurate? Is it aligned with the user's intent?
Production Monitoring Blind Spots
Once deployed, how do you know if your system is degrading? What if the model provider releases a new version with different behavior? Without systematic logging and tracing, you're flying blind.
LangSmith addresses all these challenges by providing a structured approach to observation, debugging, and evaluation.
What Is LangSmith?
LangSmith is a platform for instrumenting, debugging, and evaluating LangChain and LangGraph applications (though it can work with any LLM framework). It operates on three core pillars:
- Tracing: Automatically capture every step of your LLM application
- Testing & Evaluation: Run systematic tests against your chains and compare performance metrics
- Monitoring: Track production behavior and catch regressions
Think of LangSmith as the APM (Application Performance Monitoring) tool for LLM applications—similar to how Datadog or New Relic work for traditional web services, but purpose-built for the unique characteristics of LLM systems.
Core Concept 1: Tracing and Observability
What Gets Traced?
LangSmith automatically captures:
- LLM Calls: Every prompt sent to an LLM and every completion returned
- Retriever Calls: Vector database lookups and document retrievals
- Tool Usage: Function calls, API invocations, code execution
- Parsing Operations: Token parsing, output validation
- Chain Execution: Branching logic, loops, conditional flows
- Timing Data: Latency at each step
- Token Usage: Input/output tokens for cost tracking
This creates a complete execution trace—a visual tree of everything your application did.
Why This Matters
Consider a RAG (Retrieval-Augmented Generation) pipeline that retrieves documents and generates an answer. With tracing, you can answer questions like:
- Which documents were retrieved? Are they relevant?
- What prompt was actually sent to the LLM?
- Where did the latency come from? Retrieval or generation?
- Did the LLM use the retrieved context effectively?
- How many tokens were used at each step?
Without tracing, you're guessing. With it, you have data.
Simple Tracing Example
Let's see how to instrument a basic LangChain application with LangSmith:
pythonimport os from langsmith import traceable from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate # Set your LangSmith API key os.environ["LANGSMITH_API_KEY"] = "your-api-key-here" os.environ["LANGSMITH_PROJECT"] = "my-project" # Create a traced function @traceable(name="simple_query") def answer_question(question: str) -> str: """Answer a question using GPT-4.""" llm = ChatOpenAI(model="gpt-4", temperature=0) prompt = ChatPromptTemplate.from_template( "Answer the following question concisely:\n{question}" ) chain = prompt | llm response = chain.invoke({"question": question}) return response.content # Call it—LangSmith will automatically trace the execution result = answer_question("What is the capital of France?") print(result)
When you run this code, LangSmith captures:
- The function entry/exit
- The LLM call with the exact prompt
- Token counts and latency
- The response
You can then view this trace in the LangSmith dashboard.
RAG Pipeline with Tracing
Here's a more complex example with retrieval:
pythonfrom langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_pinecone import PineconeVectorStore from langchain.chains import RetrievalQA from langsmith import traceable @traceable(name="rag_pipeline") def rag_query(question: str, vector_store: PineconeVectorStore) -> str: """Query with retrieval-augmented generation.""" @traceable(name="retrieve_documents") def retrieve(q: str): # This nested trace shows document retrieval separately docs = vector_store.similarity_search(q, k=4) return docs # Get relevant documents docs = retrieve(question) # Create QA chain llm = ChatOpenAI(model="gpt-4") qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(search_kwargs={"k": 4}) ) answer = qa_chain.run(question) return answer # Usage embeddings = OpenAIEmbeddings() vectorstore = PineconeVectorStore( index_name="my-index", embedding_function=embeddings ) result = rag_query("How do neural networks learn?", vectorstore)
The trace will show:
- Documents retrieved and their relevance scores
- The prompt constructed from those documents
- The LLM's response
- Any intermediate reasoning
Core Concept 2: Testing and Evaluation
Tracing gives you visibility. Testing gives you confidence. LangSmith's testing framework lets you define datasets and run your application against them systematically.
Dataset-Driven Testing
The idea is simple: create a dataset of inputs and expected outputs, run your application against them, and measure performance.
pythonfrom langsmith import Client client = Client() # Create a dataset dataset_name = "customer_support_qa" dataset = client.create_dataset( dataset_name=dataset_name, description="QA pairs for customer support chatbot" ) # Add examples to the dataset examples = [ { "input": "How do I reset my password?", "expected_output": "You can reset your password by clicking 'Forgot Password' on the login page." }, { "input": "What's your return policy?", "expected_output": "We offer 30-day returns on all items in original condition." }, { "input": "How long does shipping take?", "expected_output": "Standard shipping takes 5-7 business days." }, ] for example in examples: client.create_example( dataset_id=dataset.id, inputs={"question": example["input"]}, outputs={"answer": example["expected_output"]} )
Evaluators: Measuring Quality
For LLM outputs, we need custom evaluators because exact string matching won't work. LangSmith provides a framework for this:
pythonfrom langsmith import evaluate def answer_relevance(run, example) -> dict: """Check if the output is relevant to the input.""" # run.outputs contains the actual output # example.outputs contains expected output output = run.outputs.get("answer", "") expected = example.outputs.get("answer", "") # Use an LLM to judge relevance llm = ChatOpenAI(model="gpt-4") prompt = f""" Question: {example.inputs['question']} Expected answer: {expected} Actual answer: {output} Is the actual answer relevant to the question? Respond with 'yes' or 'no' only. """ response = llm.invoke(prompt).content.lower().strip() return {"score": 1 if "yes" in response else 0} def exact_match(run, example) -> dict: """Check for exact string match (strict).""" output = run.outputs.get("answer", "").lower().strip() expected = example.outputs.get("answer", "").lower().strip() return {"score": 1 if output == expected else 0} def contains_keywords(run, example) -> dict: """Check if output contains key information.""" output = run.outputs.get("answer", "").lower() expected = example.outputs.get("answer", "").lower() # Extract words from expected output keywords = set(expected.split()) # Count how many keywords are in the actual output matches = sum(1 for keyword in keywords if keyword in output) score = matches / len(keywords) if keywords else 0 return {"score": score}
Running an Evaluation
Now we can run our application against the test dataset and evaluate results:
pythondef support_chatbot(question: str) -> dict: """Our chatbot implementation.""" llm = ChatOpenAI(model="gpt-4", temperature=0) system_prompt = """You are a helpful customer support agent. Answer questions about our products, policies, and services accurately and concisely.""" prompt = ChatPromptTemplate.from_messages([ ("system", system_prompt), ("user", "{question}") ]) chain = prompt | llm response = chain.invoke({"question": question}) return {"answer": response.content} # Run evaluation experiment_results = evaluate( support_chatbot, data=dataset_name, evaluators=[answer_relevance, exact_match, contains_keywords], experiment_prefix="chatbot_v1" ) # Results are available in LangSmith dashboard print(f"Experiment: {experiment_results}")
The dashboard shows:
- Pass/fail rate for each evaluator
- Detailed results for each test case
- Failed examples for debugging
- Comparison with previous runs
Core Concept 3: Monitoring and Feedback Loops
Production is where things get interesting (and scary). LangSmith provides monitoring capabilities to track your application's health in the wild.
Automatic Tracing in Production
Once deployed, your LangChain application automatically logs to LangSmith without code changes:
pythonimport os from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate # Set environment variables os.environ["LANGSMITH_API_KEY"] = "your-key" os.environ["LANGSMITH_PROJECT"] = "production" os.environ["LANGSMITH_TRACING"] = "true" # Enable production tracing # Your app runs normally, but everything is traced llm = ChatOpenAI(model="gpt-4") prompt = ChatPromptTemplate.from_template("Answer: {question}") chain = prompt | llm response = chain.invoke({"question": "What is AI?"})
Collecting User Feedback
The most valuable feedback comes from users. LangSmith lets you attach feedback to traces:
pythonfrom langsmith import Client client = Client() # After generating a response, collect user feedback def save_feedback(trace_id: str, user_rating: int, user_comment: str = ""): """Save user feedback to a trace.""" client.create_feedback( run_id=trace_id, key="user_rating", score=user_rating, # 1-5 stars comment=user_comment ) # In your application: # 1. Get the trace ID (automatically available in LangChain) # 2. Show feedback button to user # 3. Save feedback when user clicks # Example in a Flask app: from flask import Flask, request app = Flask(__name__) @app.route("/feedback", methods=["POST"]) def feedback(): trace_id = request.json["trace_id"] rating = request.json["rating"] comment = request.json.get("comment", "") save_feedback(trace_id, rating, comment) return {"status": "ok"}
Analyzing Production Patterns
With production traces and feedback, you can identify patterns:
pythonfrom langsmith import Client client = Client() # Get all traces from the last 24 hours runs = client.list_runs( project_name="production", filter='gt(created_at, "2024-01-15")', limit=1000 ) # Analyze failure patterns failures = [r for r in runs if r.status == "error"] low_ratings = [r for r in runs if r.feedback and r.feedback[0].score <= 2] # Print insights print(f"Total runs: {len(list(runs))}") print(f"Errors: {len(failures)}") print(f"Low user ratings: {len(low_ratings)}") # Identify common failure patterns error_types = {} for run in failures: error = run.error or "unknown" error_types[error] = error_types.get(error, 0) + 1 print("\nMost common errors:") for error, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True)[:5]: print(f" {error}: {count}")
Advanced Pattern: Multi-Step Agent Tracing
LangSmith really shines with complex agents that make multiple LLM calls, use tools, and have branching logic. Here's a more realistic example:
pythonfrom langchain.agents import Tool, initialize_agent, AgentType from langchain_openai import ChatOpenAI from langsmith import traceable # Define tools for the agent def search_api(query: str) -> str: """Search our knowledge base.""" # Placeholder—would call real search API return f"Results for '{query}': [found relevant documents]" def calculate_metric(metric_name: str, data: str) -> str: """Calculate a business metric.""" # Placeholder—would call real calculation return f"Calculated {metric_name}: 42.5" tools = [ Tool( name="search", func=search_api, description="Search the knowledge base for information" ), Tool( name="calculate", func=calculate_metric, description="Calculate business metrics" ), ] @traceable(name="intelligent_agent") def run_agent(task: str) -> str: """An intelligent agent that can use tools and reason.""" ll
Share this article
Related Articles
Memory in AI Systems: From Agent Recall to Efficient LLM Caching
A deep dive into memo for AI engineers.
ReAct in Agentic AI: Building Intelligent Agents That Think and Act
A deep dive into ReAct in Agentic AI for AI engineers.
Circuit Breaking in Agentic AI: Building Resilient Autonomous Systems
A deep dive into Circuit breaking in Agentic AI for AI engineers.

