LangChain integration

Build autonomous web search agents and research assistants that can find, analyze, and synthesize information from millions of web pages using natural language.

Before you start

Before you begin, make sure you have:

Python 3.9 or later
LangChain installed (pip install langchain-core)
CatchAll API key (obtain from platform.newscatcherapi.com)
Basic familiarity with LangChain concepts (agents, tools, LLMs)

Installation

PyPI
GitHub (development)

    pip install langchain-catchall

    git clone https://github.com/NewscatcherAPI/langchain-catchall.git
    cd langchain-catchall
    pip install -e .

Quickstart

Submit a search query and get structured results:

import os
from langchain_catchall import CatchAllClient

client = CatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])

# Search and wait for results (10-15 minutes)
result = client.search("Semiconductor company earnings announcements")

print(f"Found {result.valid_records} records")
for record in result.all_records[:3]:
    print(f"- {record.record_title}")

Jobs typically complete in 10-15 minutes. The search() method handles submission, polling, and retrieval automatically.

CatchAllClient

CatchAllClient wraps the CatchAll Python SDK with LangChain-friendly patterns. Use it for manual control in scripts, data pipelines, and async applications.

Initialize client

Sync
Async

import os
from langchain_catchall import CatchAllClient

client = CatchAllClient( api_key=os.environ["CATCHALL_API_KEY"],
poll_interval=30, # Check status every 30 seconds (recommended: 30-60s)
max_wait_time=2400, # Timeout after 40 minutes (typical jobs: 10-15 min) 
)

import os
from langchain_catchall import AsyncCatchAllClient

client = AsyncCatchAllClient(
    api_key=os.environ["CATCHALL_API_KEY"],
    poll_interval=30,      # Check status every 30 seconds
    max_wait_time=2400,    # Timeout after 40 minutes
)

Submit job

Create a new search job:

Sync
Async

job_id = client.submit_job(
    query="AI company acquisitions and mergers",
    context="Focus on deal size and technology sector",
    schema="[ACQUIRER] acquired [TARGET] for [AMOUNT]",
)
print(f"Job submitted: {job_id}")

job_id = await client.submit_job(
    query="AI company acquisitions and mergers",
    context="Focus on deal size and technology sector",
    schema="[ACQUIRER] acquired [TARGET] for [AMOUNT]",
)
print(f"Job submitted: {job_id}")

Wait for completion

Block until job finishes:

Sync
Async

client.wait_for_completion(job_id)
print("Job completed!")

await client.wait_for_completion(job_id)
print("Job completed!")

Raises TimeoutError if job exceeds max_wait_time.

Retrieve results

Get structured records:

Sync
Async

# Get first page
result = client.get_results(job_id, page=1, page_size=100)

# Get all pages
result = client.get_all_results(job_id)

for record in result.all_records:
    print(f"Title: {record.record_title}")
    print(f"Data: {record.enrichment}")
    print(f"Sources: {len(record.citations)} articles")

# Get first page
result = await client.get_results(job_id, page=1, page_size=100)

# Get all pages
result = await client.get_all_results(job_id)

for record in result.all_records:
    print(f"Title: {record.record_title}")
    print(f"Data: {record.enrichment}")
    print(f"Sources: {len(record.citations)} articles")

Convenience search

Combine submit, wait, and retrieve in one call:

Sync
Async

result = client.search(
    query="Data breach incidents at financial institutions",
    context="Include incident type and affected customer count",
)

print(f"Found {result.valid_records} records")

result = await client.search(
    query="Data breach incidents at financial institutions",
    context="Include incident type and affected customer count",
)

print(f"Found {result.valid_records} records")

Set wait=False to return immediately without waiting:

Sync
Async

result = client.search("FDA drug approvals for oncology treatments", wait=False)
print(f"Job ID: {result.job_id}")
# Retrieve later with client.get_all_results(result.job_id)

result = await client.search("FDA drug approvals for oncology treatments", wait=False)
print(f"Job ID: {result.job_id}")
# Retrieve later with await client.get_all_results(result.job_id)

List jobs

View all jobs for your API key:

Sync
Async

jobs = client.list_jobs()

for job in jobs:
    print(f"{job.job_id}: {job.query}")

jobs = await client.list_jobs()

for job in jobs:
    print(f"{job.job_id}: {job.query}")

Advanced patterns

Granular control
Cost optimization
With LLM analysis

Store job ID for later retrieval (useful for data pipelines):

import os
from langchain_catchall import CatchAllClient

client = CatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])

# Submit and store job_id for later retrieval
job_id = client.submit_job("Technology company IPO filings")

# Store job_id (example using a dict - replace with your database)
job_cache = {}
job_cache["ipo_tracker"] = job_id

# Later: Check if completed and retrieve
status = client.get_status(job_id)
completed = any(s.status == 'completed' and s.completed for s in status.steps)

if completed:
    result = client.get_all_results(job_id)
    print(f"Retrieved {result.valid_records} records from cached job")

Reuse job results without re-running search:

import os
from langchain_catchall import CatchAllClient, query_with_llm
from langchain_openai import ChatOpenAI

client = CatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])
llm = ChatOpenAI(model="gpt-4o")

# Search once
result = client.search("Enterprise software company earnings reports")

# Query many times (no additional API cost)
answer1 = query_with_llm(result, "Which companies reported highest revenue?", llm)
answer2 = query_with_llm(result, "Compare year-over-year growth rates", llm)
answer3 = query_with_llm(result, "What are the key trends?", llm)

Combine search with LLM analysis:

import os
from langchain_catchall import CatchAllClient, query_with_llm
from langchain_openai import ChatOpenAI

client = CatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])
llm = ChatOpenAI(model="gpt-4o")

# Submit job
job_id = client.submit_job("AI startup funding rounds over $10M")
client.wait_for_completion(job_id)
result = client.get_all_results(job_id)

# Analyze with LLM
answer = query_with_llm(
    result=result,
    question="Summarize top 5 deals by funding amount",
    llm=llm,
    max_records=100,  # Limit context size for faster analysis
)

print(answer)

Show Complete example: Async web scraping pipeline

import asyncio
import os
from langchain_catchall import AsyncCatchAllClient

async def process_multiple_queries():
    """Submit multiple searches concurrently."""
    client = AsyncCatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])

    queries = [
        "Technology company acquisitions and mergers",
        "Healthcare and biotech company IPO filings",
        "Retail company bankruptcy filings and restructuring",
    ]

    try:
        # Submit all jobs concurrently
        job_ids = await asyncio.gather(*[
            client.submit_job(query) for query in queries
        ])

        print(f"Submitted {len(job_ids)} jobs")

        # Wait for all completions
        await asyncio.gather(*[
            client.wait_for_completion(job_id) for job_id in job_ids
        ])

        # Retrieve all results
        results = await asyncio.gather(*[
            client.get_all_results(job_id) for job_id in job_ids
        ])

        # Process results
        for query, result in zip(queries, results):
            print(f"\n{query}: {result.valid_records} records")

    except TimeoutError as e:
        print(f"One or more jobs timed out: {e}")
    except Exception as e:
        print(f"Error processing queries: {e}")
        raise

if __name__ == "__main__":
    asyncio.run(process_multiple_queries())

CatchAllTools

CatchAllTools provides ready-to-use tools for LangGraph agents with built-in caching. Search once, then analyze many times without additional API costs.

Initialize toolkit

import os
from langchain_openai import ChatOpenAI
from langchain_catchall import CatchAllTools

llm = ChatOpenAI(model="gpt-4o")

toolkit = CatchAllTools(
    api_key=os.environ["CATCHALL_API_KEY"],
    llm=llm,
    max_results=100,     # Balance between context size and completeness
    verbose=True,        # Show progress bars and logs
)

tools = toolkit.get_tools()

Available tools

The toolkit provides two tools:

catchall_search_data: Initialize new search (10-15 min operation).
catchall_analyze_data: Query cached results (instant).

CATCHALL_AGENT_PROMPT teaches the agent when to search vs. analyze. This prompt is critical for cost-effective operation.

Create agent

Build an autonomous research agent with LangGraph:

import os
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain.messages import SystemMessage
from langchain_catchall import CatchAllTools, CATCHALL_AGENT_PROMPT

# Initialize components
llm = ChatOpenAI(model="gpt-4o")
toolkit = CatchAllTools(
    api_key=os.environ["CATCHALL_API_KEY"],
    llm=llm,
    verbose=True
)
tools = toolkit.get_tools()

# Create agent with prompt
agent = create_react_agent(model=llm, tools=tools)
messages = [SystemMessage(content=CATCHALL_AGENT_PROMPT)]

# Run agent
response = agent.invoke({
    "messages": messages + [("user", "Find technology company acquisitions announced this week")]
})

print(response["messages"][-1].content)

Show CATCHALL_AGENT_PROMPT content

CATCHALL_AGENT_PROMPT = """You are a News Research Assistant powered by CatchAll.

Your workflow is strictly defined:

1. SEARCH: Use `catchall_search_data` to get a broad initial dataset (e.g., 'Find all US office openings').
   - WARNING: This tool takes 15 minutes. NEVER call it twice in a row.
   - After searching, STOP and return what you found. WAIT for the user's next question.
   - DO NOT automatically analyze or summarize unless explicitly asked.
   
2. ANALYZE: Use `catchall_analyze_data` ONLY when the user asks a follow-up question.
   - FILTERING & SORTING: 'Show me only Florida deals', 'Sort by date', 'Find top 3'.
   - AGGREGATION: 'Group by state', 'Count by industry'.
   - QA: 'What are the main trends?', 'Summarize key findings'.
   
CRITICAL RULES:
- After a search completes, report the number of results found and STOP. Wait for user input.
- ONLY call analyze_data when the user explicitly asks a follow-up question.
- If user says "Find X", just search and report results. If they say "Summarize Y" or "Show me Z", then analyze.
- Never use `catchall_search_data` to filter. Always use `catchall_analyze_data` for filtering.
- If the user asks for a subset of data (like 'only Florida deals'), assume it is ALREADY in your search results.
- Only use `catchall_search_data` if the user explicitly asks for a 'new search' or a completely different topic.
"""

This prompt teaches the agent to:

Search once for broad topics
Use cached data for filtering and analysis
Avoid expensive repeated searches

Conversational agent pattern

Build an interactive agent that remembers previous searches:

import os
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain.messages import SystemMessage
from langchain_catchall import CatchAllTools, CATCHALL_AGENT_PROMPT

# Setup
llm = ChatOpenAI(model="gpt-4o")
toolkit = CatchAllTools(
    api_key=os.environ["CATCHALL_API_KEY"],
    llm=llm,
    verbose=True
)
tools = toolkit.get_tools()

agent = create_react_agent(model=llm, tools=tools)
messages = [SystemMessage(content=CATCHALL_AGENT_PROMPT)]

# Initial search
messages.append(("user", "Find articles about corporate headquarters relocations and office openings in the US"))
response = agent.invoke({"messages": messages})
messages.append(("assistant", response["messages"][-1].content))

# Follow-up 1: Filter cached data
messages.append(("user", "Show only California locations"))
response = agent.invoke({"messages": messages})
messages.append(("assistant", response["messages"][-1].content))

# Follow-up 2: Analyze cached data
messages.append(("user", "What are the top 3 cities by number of openings?"))
response = agent.invoke({"messages": messages})
print(response["messages"][-1].content)

Key pattern:

First message: Agent calls catchall_search_data (10-15 min)
Follow-up messages: Agent calls catchall_analyze_data (instant)
New topic: Agent calls catchall_search_data again

Show Complete example: Interactive research session

import os
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain.messages import SystemMessage
from langchain_catchall import CatchAllTools, CATCHALL_AGENT_PROMPT

def run_interactive_agent(): """Run an interactive research agent with
conversation history."""

    try:
        # Setup
        api_key = os.environ.get("CATCHALL_API_KEY")
        if not api_key:
            raise ValueError("CATCHALL_API_KEY environment variable not set")

        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        toolkit = CatchAllTools(api_key=api_key, llm=llm, verbose=True)
        tools = toolkit.get_tools()

        # Create agent
        agent = create_react_agent(model=llm, tools=tools)
        messages = [SystemMessage(content=CATCHALL_AGENT_PROMPT)]

        print("Research Agent Ready!")
        print("Type 'quit' to exit\n")

        while True:
            try:
                # Get user input
                user_input = input("You: ").strip()
                if user_input.lower() == 'quit':
                    break

                if not user_input:
                    continue

                # Add user message
                messages.append(("user", user_input))

                # Get agent response
                response = agent.invoke({"messages": messages})
                assistant_message = response["messages"][-1].content

                # Add assistant message to history
                messages.append(("assistant", assistant_message))

                # Display response
                print(f"\nAgent: {assistant_message}\n")

            except KeyboardInterrupt:
                print("\nExiting...")
                break
            except Exception as e:
                print(f"\nError: {e}")
                print("Continuing...\n")

    except Exception as e:
        print(f"Failed to initialize agent: {e}")
        raise

if __name__ == "__main__":
    run_interactive_agent()

Example session:

You: Find venture capital funding rounds for biotech startups

Agent: I'll search for biotech venture funding articles... [15 minutes later]
Found 47 records. Here are the top deals:

1. BioTech Corp raised $25M Series B
2. GeneTech raised $15M Series A ...

You: Show only deals over $20M

Agent: [Instantly] Based on the cached results, here are deals over $20M:

1. BioTech Corp - $25M Series B
2. MedTech Inc - $30M Series C ...

You: What's the average funding amount?

Agent: [Instantly] Analyzing the data... The average funding amount is $18.5M 
across all 47 deals.

Search once, analyze many

The most powerful pattern: perform expensive search once, then run unlimited free analyses:

import os
from langchain_catchall import CatchAllClient, query_with_llm
from langchain_openai import ChatOpenAI

# Setup
client = CatchAllClient(api_key=os.environ["CATCHALL_API_KEY"])
llm = ChatOpenAI(model="gpt-4o")

# Search once (10-15 minutes, costs API credits)
result = client.search("Cloud computing company quarterly earnings")

# Analyze many times (instant, no additional cost)
questions = [
    "Which companies had highest revenue growth?",
    "Compare profit margins across companies",
    "What are key trends in the earnings reports?",
    "List companies by market cap",
    "Summarize cloud computing revenue",
]

for question in questions:
    answer = query_with_llm(result, question, llm)
    print(f"\nQ: {question}")
    print(f"A: {answer}")

This pattern is ideal for:

Financial analysis (analyze same dataset from multiple angles)
Research reports (extract different insights from one search)
Exploratory data analysis (iterate on questions without re-fetching)

Error handling

Handle timeouts and failures gracefully:

import os
from langchain_catchall import CatchAllClient

client = CatchAllClient(
    api_key=os.environ["CATCHALL_API_KEY"],
    max_wait_time=2400  # 40 minutes
)

try:
    result = client.search("Venture capital funding rounds across all industries")
    print(f"Success: {result.valid_records} records")

except TimeoutError as e:
    print(f"Search timed out after 30 minutes: {e}")
    # Retry with narrower query
    result = client.search("Series B funding rounds for fintech startups")

except Exception as e:
    print(f"Unexpected error: {e}")
    raise

Monitors

Monitors automate recurring CatchAll searches with scheduled execution. The langchain-catchall package does not support Monitors. To use Monitors, install the underlying SDK:

pip install newscatcher-catchall-sdk

See the Monitors documentation for complete usage guide.

Next steps

Write effective queries

Learn to construct queries that return focused results

Python SDK

Full Python SDK documentation with all features

API reference

Complete API endpoint documentation

Dynamic schemas

Understand variable response structures

Overview

How to

Endpoints

Libraries

Integrations

LangChain integration

Before you start

Installation

Quickstart

CatchAllClient

Initialize client

Submit job

Wait for completion

Retrieve results

Convenience search

List jobs

Advanced patterns

CatchAllTools

Initialize toolkit

Available tools

Create agent

Conversational agent pattern

Search once, analyze many

Error handling

Monitors

Next steps

Write effective queries

Python SDK

API reference

Dynamic schemas

See also

Overview

How to

Endpoints

Libraries

Integrations

​Before you start

​Installation

​Quickstart

​CatchAllClient

​Initialize client

​Submit job

​Wait for completion

​Retrieve results

​Convenience search

​List jobs

​Advanced patterns

​CatchAllTools

​Initialize toolkit

​Available tools

​Create agent

​Conversational agent pattern

​Search once, analyze many

​Error handling

​Monitors

​Next steps

Write effective queries

Python SDK

API reference

Dynamic schemas

​See also

Before you start

Installation

Quickstart

CatchAllClient

Initialize client

Submit job

Wait for completion

Retrieve results

Convenience search

List jobs

Advanced patterns

CatchAllTools

Initialize toolkit

Available tools

Create agent

Conversational agent pattern

Search once, analyze many

Error handling

Monitors

Next steps

See also