> ## Documentation Index > Fetch the complete documentation index at: https://newscatcherinc-docs.mintlify.site/docs/llms.txt > Use this file to discover all available pages before exploring further. # Working with historical data > Strategies for efficiently querying large volumes of historical news data while avoiding common performance pitfalls. ## How data is indexed News API stores data in monthly indexes, optimized for search within a single month. Queries that span multiple months access multiple indexes, and performance degrades proportionally with the time range — queries across 5+ years can cause significant slowdowns. ## Technical limitations While you technically can query data across our entire historical range (2019 to present), doing so in a single request is not recommended for several reasons: * **Performance degradation** — queries spanning multiple years search across numerous indexes, significantly increasing response time. * **Request timeouts** — complex queries combined with long time ranges may time out before completion (default: 30 seconds). * **Multi-index complexity** — long time ranges require coordinating searches across multiple monthly indexes. * **Limited result access** — the API limits responses to 10,000 articles per request, so long time range queries may miss the most relevant historical data. ## NLP data availability Historical data is available from 2019 onward. NLP enrichment is available only for articles indexed from July 2023 onward. For earlier articles, the `nlp` field is present in responses but returned as an empty object `{}`. To request NLP enrichment for pre-July 2023 data, contact [support@newscatcherapi.com](mailto:support@newscatcherapi.com). ## Efficient query patterns To retrieve historical data efficiently, break your queries into time chunks rather than querying the full date range at once. ### Incorrect approach ``` q=financial crisis&from_=2019-01-01&to_=2025-01-01 ``` This query attempts to search approximately 72 monthly indexes at once, which may lead to poor performance or timeout errors (408 Request Timeout). ### Recommended approach Before retrieving actual articles, use the `/aggregation_count` endpoint to understand the volume of data matching your query across time periods. **Example request:** ```json JSON theme={null} { "q": "your search query", "aggregation_by": "month", "from_": "2020-01-01", "to_": "2020-12-31", "lang": "en" } ``` ```python Python theme={null} import datetime from newscatcher import NewscatcherApi client = NewscatcherApi(api_key="YOUR_API_KEY") # Get aggregation for planning aggregation_response = client.aggregation.post( q="your search query", from_=datetime.datetime(2020, 1, 1), to=datetime.datetime(2020, 12, 31, 23, 59, 59), aggregation_by="month", lang=["en"] ) ``` ```typescript TypeScript theme={null} import { NewscatcherApiClient } from "newscatcher-sdk"; const client = new NewscatcherApiClient({ apiKey: "YOUR_API_KEY" }); // Get aggregation for planning const aggregationResponse = await client.aggregation.post({ q: "your search query", from: new Date("2020-01-01T00:00:00.000Z"), to: new Date("2020-12-31T23:59:59.999Z"), aggregationBy: "month", lang: ["en"] }); ``` ```java Java theme={null} import com.newscatcher.api.NewscatcherApiClient; import com.newscatcher.api.resources.aggregation.requests.PostAggregationRequest; import java.time.Instant; NewscatcherApiClient client = NewscatcherApiClient.builder() .apiKey("YOUR_API_KEY") .build(); var response = client.aggregation().post( PostAggregationRequest.builder() .q("your search query") .from("2020-01-01T00:00:00.000Z") .to("2020-12-31T23:59:59.999Z") .aggregationBy("month") .lang("en") .build() ); ``` **Example response:** ```json theme={null} { "aggregations": [ { "aggregation_count": [ { "time_frame": "2020-01-01 00:00:00", "article_count": 2450 }, { "time_frame": "2020-02-01 00:00:00", "article_count": 3120 } // Additional months... ] } ] } ``` Retrieve articles in monthly or weekly chunks. Complex queries spanning more than 30 days risk 408 timeout errors — if a chunk times out, subdivide it further ```json JSON theme={null} { "q": "your search query", "from_": "2020-01-01", "to_": "2020-01-31", "page_size": 100, "page": 1 // Additional parameters as needed } ``` ```python Python theme={null} import datetime from newscatcher import NewscatcherApi client = NewscatcherApi(api_key="YOUR_API_KEY") # Process one time chunk at a time response = client.search.post( q="your search query", from_=datetime.datetime(2020, 1, 1), to=datetime.datetime(2020, 1, 31, 23, 59, 59), page_size=100, page=1 ) ``` ```typescript TypeScript theme={null} import { NewscatcherApiClient } from "newscatcher-sdk"; const client = new NewscatcherApiClient({ apiKey: "YOUR_API_KEY" }); // Process one time chunk at a time const response = await client.search.post({ q: "your search query", from: new Date("2020-01-01T00:00:00.000Z"), to: new Date("2020-01-31T23:59:59.999Z"), pageSize: 100, page: 1 }); ``` ```java Java theme={null} import com.newscatcher.api.NewscatcherApiClient; import com.newscatcher.api.resources.search.requests.PostSearchRequest; import java.time.Instant; NewscatcherApiClient client = NewscatcherApiClient.builder() .apiKey("YOUR_API_KEY") .build(); var response = client.search().post( PostSearchRequest.builder() .q("your search query") .from("2020-01-01T00:00:00.000Z") .to("2020-01-31T23:59:59.999Z") .pageSize(100) .page(1) .build() ); ``` ### Example implementation Here's a practical example showing how to retrieve a week of data using the recommended approach. The same logic scales to retrieve months or years by adjusting the date ranges and aggregation period (day/month): ```python Python theme={null} import datetime import time from typing import List, Dict, Any from newscatcher import NewscatcherApi from newscatcher.core.api_error import ApiError import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv("NEWSCATCHER_API_KEY") import datetime import time import json from typing import List, Dict, Any, Optional from newscatcher import NewscatcherApi from newscatcher.core.api_error import ApiError def retrieve_week_of_data( client: NewscatcherApi, query: str, start_date: datetime.datetime, end_date: datetime.datetime, output_file: Optional[str] = None, ) -> List[Dict[str, Any]]: """ Retrieve a week of historical data using daily aggregation. Args: client: Configured NewscatcherApi client instance query: Search query string start_date: Start date for the week end_date: End date for the week output_file: Filename for JSON output (without .json extension) Returns: List of all articles retrieved for the entire week """ results = [] # Step 1: Get daily data volumes try: aggregation_response = client.aggregation.post( q=query, from_=start_date, to=end_date, aggregation_by="day", lang=["en"], ) # Log daily volumes for planning if aggregation_response.aggregations: print("Daily data volumes:") aggregation_data = aggregation_response.aggregations[0]["aggregation_count"] for day_data in aggregation_data: print( f" {day_data['time_frame']}: {day_data['article_count']} articles" ) print(f"Total articles expected: {aggregation_response.total_hits}") except ApiError as e: print(f"Error getting aggregation data: {e.status_code} - {e.body}") return results # Step 2: Process each day in the week current_date = start_date.date() end_date_only = end_date.date() while current_date <= end_date_only: # Set time bounds for the current day day_start = datetime.datetime.combine(current_date, datetime.time.min) day_end = datetime.datetime.combine(current_date, datetime.time.max) print(f"Processing {current_date}") current_page = 1 total_pages = 1 daily_articles = 0 # Step 3: Paginate through the day's data while current_page <= total_pages: try: response = client.search.post( q=query, from_=day_start, to=day_end, lang=["en"], page=current_page, page_size=100, ) # Add articles to results (store as original JSON/dict) if response.articles: # Convert to dict to ensure JSON serialization works for article in response.articles: results.append(article.__dict__) daily_articles += len(response.articles) # Get pagination info from response total_pages = response.total_pages or 1 current_page += 1 print(f" Retrieved page {current_page - 1} of {total_pages}") # Add delay between requests to respect rate limits time.sleep(1) except ApiError as e: if e.status_code == 408: print( " Request timeout. The time window might contain too many articles." ) # For daily windows, this is less likely, but could divide into hours if needed elif e.status_code == 429: print(" Rate limit hit. Waiting longer...") time.sleep(5) continue # Retry the same page else: print(f" API Error ({e.status_code}): {e.body}") # Break the pagination loop on non-recoverable errors break except Exception as e: print(f" Unexpected error: {e}") break print( f" Completed {current_date}, retrieved {daily_articles} articles for this day" ) print(f" Total articles so far: {len(results)}") # Move to next day current_date += datetime.timedelta(days=1) # Save results if output file specified if output_file and results: save_articles_to_json(results, output_file) return results def save_articles_to_json(articles: List[Dict[str, Any]], filename: str): """Save articles array to JSON file.""" json_filename = f"{filename}.json" with open(json_filename, "w", encoding="utf-8") as f: json.dump(articles, f, indent=2, ensure_ascii=False, default=str) print(f"Saved {len(articles)} articles to {json_filename}") def main(): """Test the weekly data retrieval function with transport strike query.""" client = NewscatcherApi(api_key=API_KEY) # Define the test week (adjust dates as needed) start_date = datetime.datetime(2025, 5, 15) end_date = datetime.datetime(2025, 5, 22) # Your complex transport strike query query = '(airport OR "freight port" OR train) AND (strike OR "union protest" OR "planned closure" OR "worker dispute") AND NOT (past OR historical OR ended)' try: print( f"Testing weekly data retrieval from {start_date.date()} to {end_date.date()}" ) print(f"Query: {query}") print("=" * 80) # Generate output filename based on date range output_file = f"transport_strikes_{start_date.strftime('%Y%m%d')}_{end_date.strftime('%Y%m%d')}" articles = retrieve_week_of_data( client, query, start_date, end_date, output_file=output_file ) print("=" * 80) print(f"SUCCESS: Retrieved {len(articles)} articles total") # Optional: Show some sample data if articles: print("\nSample articles:") for i, article in enumerate(articles[:3]): # Show first 3 articles print(f"{i+1}. {article.get('title', 'No title')}") print(f" Published: {article.get('published_date', 'Unknown date')}") print(f" Source: {article.get('name_source', 'Unknown source')}") print(f" URL: {article.get('link', 'No URL')}") print() except Exception as error: print(f"FAILED: {error}") if __name__ == "__main__": main() ``` ```typescript TypeScript theme={null} import * as dotenv from "dotenv"; import * as fs from "fs/promises"; import { NewscatcherApi, NewscatcherApiClient, NewscatcherApiError, } from "newscatcher-sdk"; // Load environment variables dotenv.config(); const API_KEY = process.env.NEWSCATCHER_API_KEY; /** * Retrieve a week of historical data using daily aggregation. * * @param client - Configured NewscatcherApiClient instance * @param query - Search query string * @param startDate - Start date for the week * @param endDate - End date for the week * @param outputFile - Optional filename for JSON output (without .json extension) * @returns Promise resolving to array of all articles for the week */ async function retrieveWeekOfData( client: NewscatcherApiClient, query: string, startDate: Date, endDate: Date, outputFile?: string ): Promise { const results: any[] = []; // Step 1: Get daily data volumes try { const aggregationRequest: NewscatcherApi.AggregationPostRequest = { q: query, from: startDate, to: endDate, aggregationBy: "day", lang: ["en"], }; const aggregationResponse = await client.aggregation.post( aggregationRequest ); // Log daily volumes for planning const responseData = aggregationResponse as any; if ( responseData.aggregations && responseData.aggregations[0]?.aggregationCount ) { console.log("Daily data volumes:"); const aggregationData = responseData.aggregations[0].aggregationCount; for (const dayData of aggregationData) { console.log(` ${dayData.timeFrame}: ${dayData.articleCount} articles`); } console.log(`Total articles expected: ${responseData.totalHits}`); } } catch (error) { if (error instanceof NewscatcherApiError) { console.error( `Error getting aggregation data: ${error.statusCode} - ${error.message}` ); } else { console.error(`Error getting aggregation data: ${error}`); } return results; } // Step 2: Process each day in the week const currentDate = new Date(startDate); const endDateTime = new Date(endDate); while (currentDate <= endDateTime) { // Set time bounds for the current day const dayStart = new Date(currentDate); dayStart.setHours(0, 0, 0, 0); const dayEnd = new Date(currentDate); dayEnd.setHours(23, 59, 59, 999); console.log(`Processing ${currentDate.toISOString().split("T")[0]}`); let currentPage = 1; let totalPages = 1; let dailyArticles = 0; // Step 3: Paginate through the day's data while (currentPage <= totalPages) { try { const searchRequest: NewscatcherApi.SearchPostRequest = { q: query, from: dayStart, to: dayEnd, lang: ["en"], page: currentPage, pageSize: 100, }; const response = await client.search.post(searchRequest); const responseData = response as any; // Add articles to results if (responseData.articles && Array.isArray(responseData.articles)) { results.push(...responseData.articles); dailyArticles += responseData.articles.length; } // Get pagination info from response totalPages = responseData.totalPages || 1; currentPage++; console.log(` Retrieved page ${currentPage - 1} of ${totalPages}`); // Add delay between requests to respect rate limits await new Promise((resolve) => setTimeout(resolve, 1000)); } catch (error) { if (error instanceof NewscatcherApiError) { if (error.statusCode === 408) { console.log( " Request timeout. The time window might contain too many articles." ); } else if (error.statusCode === 429) { console.log(" Rate limit hit. Waiting longer..."); await new Promise((resolve) => setTimeout(resolve, 5000)); continue; // Retry the same page } else { console.log(` API Error (${error.statusCode}): ${error.message}`); } } else { console.log(` Unexpected error: ${error}`); } // Break the pagination loop on non-recoverable errors break; } } console.log( ` Completed ${ currentDate.toISOString().split("T")[0] }, retrieved ${dailyArticles} articles for this day` ); console.log(` Total articles so far: ${results.length}`); // Move to next day currentDate.setDate(currentDate.getDate() + 1); } // Save results if output file specified if (outputFile && results.length > 0) { await saveArticlesToJson(results, outputFile); } return results; } /** * Save articles array to JSON file. */ async function saveArticlesToJson( articles: any[], filename: string ): Promise { const jsonFilename = `${filename}.json`; try { await fs.writeFile( jsonFilename, JSON.stringify(articles, null, 2), "utf-8" ); console.log(`Saved ${articles.length} articles to ${jsonFilename}`); } catch (error) { console.error(`Error saving file: ${error}`); } } /** * Example usage of the weekly data retrieval function. */ async function main(): Promise { if (!API_KEY) { console.error("NEWSCATCHER_API_KEY environment variable is required"); console.error("Please check your .env file"); process.exit(1); } const client = new NewscatcherApiClient({ apiKey: API_KEY, }); // Define the date range (adjust as needed) const startDate = new Date("2025-05-15T00:00:00.000Z"); const endDate = new Date("2025-05-22T23:59:59.999Z"); // Example query for transport strikes const query = '(airport OR "freight port" OR train) AND (strike OR "union protest" OR "planned closure" OR "worker dispute") AND NOT (past OR historical OR ended)'; try { console.log( `Retrieving data from ${startDate.toISOString().split("T")[0]} to ${ endDate.toISOString().split("T")[0] }` ); console.log(`Query: ${query}`); console.log("=".repeat(80)); // Generate output filename based on date range const formatDate = (date: Date) => date.toISOString().split("T")[0].replace(/-/g, ""); const outputFile = `articles_${formatDate(startDate)}_${formatDate( endDate )}`; const articles = await retrieveWeekOfData( client, query, startDate, endDate, outputFile ); console.log("=".repeat(80)); console.log(`SUCCESS: Retrieved ${articles.length} articles total`); // Show some sample data if (articles.length > 0) { console.log("\nSample articles:"); for (let i = 0; i < Math.min(3, articles.length); i++) { const article = articles[i]; console.log(`${i + 1}. ${article.title || "No title"}`); console.log( ` Published: ${ article.publishedDate || article.published_date || "Unknown date" }` ); console.log( ` Source: ${ article.nameSource || article.name_source || "Unknown source" }` ); console.log(` URL: ${article.link || "No URL"}`); console.log(); } } } catch (error) { console.error(`FAILED: ${error}`); } } // Run the main function if (require.main === module) { main(); } ``` For detailed guidance on retrieving large datasets, see [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles). ## Common pitfalls to avoid | Pitfall | Impact | Solution | | ------------------------------------ | --------------------------------------------- | ------------------------------------------------------------------------------------ | | Querying multiple years at once | Slow performance, timeouts (408 errors) | Break queries into monthly chunks | | Using overly broad search terms | Excessive result volume | Refine query terms to be more specific | | Insufficient error handling | Failed data retrieval | Implement robust retry and error handling | | Underestimating data volume | Resource constraints | Use aggregation endpoint to estimate volume first | | Requesting too many results per page | Slow response times | Use reasonable page sizes (100-1000) | | Improper pagination implementation | Incomplete data retrieval | See [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles) | | Expecting NLP data before July 2023 | `nlp` field is present but returned as `{}` | Set `has_nlp=true` to filter for NLP-enriched articles only | | Not prioritizing recent data | Slower iteration when validating a new query | Start with a recent short range to validate results before querying the full history | | Missing delays between requests | `429` errors interrupting long retrieval jobs | Add a delay between requests and implement exponential backoff on `429` responses | ## See also * [API Reference](/news-api/api-reference/search/search-articles-post) * [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles) * [Handle errors](/news-api/troubleshooting/error-handling)