Retrieve large datasets

News API returns up to 10,000 articles per query. For broad queries this limit is hit constantly — a search for “artificial intelligence” in English returns 10,000 results even when hundreds of thousands of matching articles exist. This guide walks through the full retrieval workflow in three steps: measure your dataset volume, choose the right chunk size, then fetch everything. All three steps include code examples for Python, TypeScript, and Java.

Before you begin

An active News API key
For Python: Python 3.10+ with News API Python SDK installed

Retrieval workflow

Measure volume with aggregation

Before writing any retrieval logic, use /aggregation_count to understand how many articles your query actually matches and how they’re distributed over time. This tells you which chunk size to use and whether your query needs narrowing.

curl -X POST "https://v3-api.newscatcherapi.com/api/aggregation_count" \
  -H "x-api-token: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "q": "artificial intelligence",
    "lang": "en",
    "aggregation_by": "day"
  }'

The response shows total volume and the per-day distribution:

{
    "status": "ok",
    "total_hits": 118562,
    "page": 1,
    "total_pages": 1186,
    "page_size": 100,
    "aggregations": [
        {
            "aggregation_count": [
                { "time_frame": "2026-05-04 00:00:00", "article_count": 18461 },
                { "time_frame": "2026-05-05 00:00:00", "article_count": 20725 },
                { "time_frame": "2026-05-06 00:00:00", "article_count": 20880 },
                { "time_frame": "2026-05-07 00:00:00", "article_count": 20973 },
                { "time_frame": "2026-05-08 00:00:00", "article_count": 15915 },
                { "time_frame": "2026-05-09 00:00:00", "article_count": 6708 },
                { "time_frame": "2026-05-10 00:00:00", "article_count": 5782 },
                { "time_frame": "2026-05-11 00:00:00", "article_count": 9118 }
            ]
        }
    ]
}

118,562 total articles, with 15,000–20,000 per day. Daily chunks would hit the 10K cap every day. You need 6-hour chunks.

Choose the right chunk size

Pick a chunk size where each window returns fewer than 10,000 articles. Use the per-period counts from Step 1:

Articles per period	Recommended chunk size
More than 10,000 per hour	`"1h"` — consider narrowing the query
More than 10,000 per day	`"6h"` or `"1h"`
3,000–10,000 per day	`"1d"`
1,000–3,000 per day	`"3d"`
100–1,000 per day	`"7d"`
Fewer than 100 per day	`"30d"`

For the “artificial intelligence” example above: most days have 15,000–20,000 articles, so "6h" is the right choice.

Retrieve all articles

With chunk size confirmed, iterate through your date range window by window, paginating each window fully before moving to the next.

import time
from newscatcher import NewscatcherApi
from newscatcher.core import ApiError

client = NewscatcherApi(api_key="YOUR_API_KEY")

def get_all_articles(query: str, from_date: str, to_date: str, chunk_hours: int) -> list:
    from datetime import datetime, timedelta

    articles = []
    window_start = datetime.fromisoformat(from_date)
    end = datetime.fromisoformat(to_date)

    while window_start < end:
        window_end = min(window_start + timedelta(hours=chunk_hours), end)

        page = 1
        total_pages = 1

        while page <= total_pages:
            try:
                response = client.search.post(
                    q=query,
                    lang="en",
                    from_=window_start.isoformat(),
                    to=window_end.isoformat(),
                    page=page,
                    page_size=1000,
                )
                articles.extend(response.articles)
                total_pages = response.total_pages
                page += 1
                time.sleep(0.5)
            except ApiError as e:
                print(f"Error on page {page}: {e}")
                break

        window_start = window_end

    return articles

articles = get_all_articles(
    "artificial intelligence",
    "2026-05-04T00:00:00",
    "2026-05-11T00:00:00",
    chunk_hours=6,
)
print(f"Retrieved {len(articles)} articles")

Python SDK: automated retrieval

Python SDK provides get_all_articles and get_all_headlines — methods that automate the workflow. They split your date range into chunks, paginate each chunk, deduplicate results, and return a combined list. You can still measure volume with /aggregation_count to choose a proper time_chunk_size, but you don’t need to write the iteration logic.

How time-chunking works

Time-chunking divides your date range into smaller intervals, makes a separate API call for each period, and combines the results. Each interval can return up to 10,000 articles. For example, with time_chunk_size="1d" over 5 days, the method makes 5 API calls — one per day — with automatic pagination, retrieving up to 50,000 articles total.

Diagram showing how time-chunking splits a date range into five daily API requests, each returning up to 10,000 articles, combined into a single deduplicated result set of up to 50,000 articles.

get_all_articles

Retrieves all articles matching a search query over a date range. Accepts all standard /search endpoint parameters via **kwargs — lang, countries, sort_by, include_nlp_data, and so on.

from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key="YOUR_API_KEY")

articles = client.get_all_articles(
    q="artificial intelligence",
    lang="en",
    from_="7d",
    to="now",
    time_chunk_size="6h",
    max_articles=50000,
    show_progress=True,
)

print(f"Retrieved {len(articles)} articles")

get_all_headlines

Retrieves all latest headlines over a time range. Accepts all standard /latest_headlines endpoint parameters via **kwargs.

from newscatcher import NewscatcherApi

client = NewscatcherApi(api_key="YOUR_API_KEY")

headlines = client.get_all_headlines(
    when="7d",
    time_chunk_size="1d",
    max_articles=20000,
    show_progress=True,
)

print(f"Retrieved {len(headlines)} headlines")

SDK method parameters

time_chunk_size

string

default:"1h"

Size of each time window. Accepted values: "1h", "6h", "1d", "7d", "1m".

max_articles

integer

default:"100000"

Maximum total articles to retrieve across all chunks.

show_progress

boolean

default:"false"

Display a progress bar during retrieval.

deduplicate

boolean

default:"true"

Remove duplicate articles from the combined results.

validate_query

boolean

default:"true"

get_all_articles only. Validates query syntax locally before making any API calls. Set to false to skip.

concurrency

integer

default:"3"

AsyncNewscatcherApi only. Number of concurrent page requests within each time chunk.

Both methods accept all other endpoint parameters via **kwargs and pass them to the API. For example, you can filter by language, sort by relevance, or include NLP data in results from either method — just as you would with direct API calls to /search or /latest_headlines.

Common issues

Rate limiting errors (429)

For async Python, reduce concurrency. For manual iteration, add delays between window requests. If limits are hit consistently, consider narrowing your query to reduce overall volume.

Timeout errors (408)

Your chunk size is still too large. Step down: "1d" → "6h" → "1h". For long historical ranges, see Working with historical data.

Memory errors

Reduce max_articles (Python SDK), or write results to disk per window rather than accumulating everything in memory.

Result counts vary between runs

News sources publish continuously. Counts for recent ranges differ between runs as new articles are indexed. Use a fixed to date for reproducible datasets.

Best practices

Measure before you iterate. One /aggregation_count call tells you the exact volume and distribution — it takes seconds and prevents wasted API calls on a wrong chunk size.
Set a fixed to date for reproducible jobs. Open-ended to="now" means results change between runs.
Use show_progress=True during development (Python SDK). It surfaces slow chunks and stalls early.
Lower max_articles if you don’t need everything (Python SDK). The default is 100,000 — set it to your actual target to avoid unnecessary calls.
Store results incrementally for large jobs. Write to disk per window rather than accumulating everything in memory.

​Before you begin

​Retrieval workflow

​Python SDK: automated retrieval

​How time-chunking works

​get_all_articles

​get_all_headlines

​SDK method parameters

​Common issues

​Best practices

​See also

Before you begin

Retrieval workflow

Python SDK: automated retrieval

How time-chunking works

get_all_articles

get_all_headlines

SDK method parameters

Common issues

Best practices

See also