> ## Documentation Index
> Fetch the complete documentation index at: https://newscatcherinc-docs.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Working with historical data

> Strategies for efficiently querying large volumes of historical news data while avoiding common performance pitfalls.

## How data is indexed

News API stores data in monthly indexes, optimized for search within a single month.
Queries that span multiple months access multiple indexes, and performance degrades
proportionally with the time range — queries across 5+ years can cause significant
slowdowns.

## Technical limitations

While you technically can query data across our entire historical range (2019 to
present), doing so in a single request is not recommended for several reasons:

* **Performance degradation** — queries spanning multiple years search across numerous indexes, significantly increasing response time.
* **Request timeouts** — complex queries combined with long time ranges may time out before completion (default: 30 seconds).
* **Multi-index complexity** — long time ranges require coordinating searches across multiple monthly indexes.
* **Limited result access** — the API limits responses to 10,000 articles per request, so long time range queries may miss the most relevant historical data.

## NLP data availability

Historical data is available from 2019 onward. NLP enrichment is available only
for articles indexed from July 2023 onward. For earlier articles, the `nlp` field
is present in responses but returned as an empty object `{}`.

To request NLP enrichment for pre-July 2023 data, contact [support@newscatcherapi.com](mailto:support@newscatcherapi.com).

## Efficient query patterns

To retrieve historical data efficiently, break your queries into time chunks
rather than querying the full date range at once.

### Incorrect approach

```
q=financial crisis&from_=2019-01-01&to_=2025-01-01
```

This query attempts to search approximately 72 monthly indexes at once, which
may lead to poor performance or timeout errors (408 Request Timeout).

### Recommended approach

<Steps>
  <Step title="Estimate data volume using aggregation">
    Before retrieving actual articles, use the `/aggregation_count` endpoint to understand the volume of data matching your query across time periods.

    **Example request:**

    <CodeGroup>
      ```json JSON theme={null}
      {
          "q": "your search query",
          "aggregation_by": "month",
          "from_": "2020-01-01",
          "to_": "2020-12-31",
          "lang": "en"
      }
      ```

      ```python Python theme={null}
      import datetime
      from newscatcher import NewscatcherApi

      client = NewscatcherApi(api_key="YOUR_API_KEY")

      # Get aggregation for planning
      aggregation_response = client.aggregation.post(
          q="your search query",
          from_=datetime.datetime(2020, 1, 1),
          to=datetime.datetime(2020, 12, 31, 23, 59, 59),
          aggregation_by="month",
          lang=["en"]
      )
      ```

      ```typescript TypeScript theme={null}
      import { NewscatcherApiClient } from "newscatcher-sdk";

      const client = new NewscatcherApiClient({
          apiKey: "YOUR_API_KEY"
      });

      // Get aggregation for planning
      const aggregationResponse = await client.aggregation.post({
          q: "your search query",
          from: new Date("2020-01-01T00:00:00.000Z"),
          to: new Date("2020-12-31T23:59:59.999Z"),
          aggregationBy: "month",
          lang: ["en"]
      });
      ```

      ```java Java theme={null}
      import com.newscatcher.api.NewscatcherApiClient;
      import com.newscatcher.api.resources.aggregation.requests.PostAggregationRequest;
      import java.time.Instant;

      NewscatcherApiClient client = NewscatcherApiClient.builder()
          .apiKey("YOUR_API_KEY")
          .build();

      var response = client.aggregation().post(
          PostAggregationRequest.builder()
              .q("your search query")
              .from("2020-01-01T00:00:00.000Z")
              .to("2020-12-31T23:59:59.999Z")
              .aggregationBy("month")
              .lang("en")
              .build()
      );
      ```
    </CodeGroup>

    **Example response:**

    ```json theme={null}
    {
        "aggregations": [
            {
                "aggregation_count": [
                    {
                        "time_frame": "2020-01-01 00:00:00",
                        "article_count": 2450
                    },
                    {
                        "time_frame": "2020-02-01 00:00:00",
                        "article_count": 3120
                    }
                    // Additional months...
                ]
            }
        ]
    }
    ```
  </Step>

  <Step title="Process data in time chunks">
    Retrieve articles in monthly or weekly chunks. Complex queries spanning more
    than 30 days risk 408 timeout errors — if a chunk times out, subdivide it further

    <CodeGroup>
      ```json JSON theme={null}
      {
          "q": "your search query",
          "from_": "2020-01-01",
          "to_": "2020-01-31",
          "page_size": 100,
          "page": 1
          // Additional parameters as needed
      }
      ```

      ```python Python theme={null}
      import datetime
      from newscatcher import NewscatcherApi

      client = NewscatcherApi(api_key="YOUR_API_KEY")

      # Process one time chunk at a time
      response = client.search.post(
          q="your search query",
          from_=datetime.datetime(2020, 1, 1),
          to=datetime.datetime(2020, 1, 31, 23, 59, 59),
          page_size=100,
          page=1
      )
      ```

      ```typescript TypeScript theme={null}
      import { NewscatcherApiClient } from "newscatcher-sdk";

      const client = new NewscatcherApiClient({
          apiKey: "YOUR_API_KEY"
      });

      // Process one time chunk at a time
      const response = await client.search.post({
          q: "your search query",
          from: new Date("2020-01-01T00:00:00.000Z"),
          to: new Date("2020-01-31T23:59:59.999Z"),
          pageSize: 100,
          page: 1
      });
      ```

      ```java Java theme={null}
      import com.newscatcher.api.NewscatcherApiClient;
      import com.newscatcher.api.resources.search.requests.PostSearchRequest;
      import java.time.Instant;

      NewscatcherApiClient client = NewscatcherApiClient.builder()
          .apiKey("YOUR_API_KEY")
          .build();

      var response = client.search().post(
          PostSearchRequest.builder()
              .q("your search query")
              .from("2020-01-01T00:00:00.000Z")
              .to("2020-01-31T23:59:59.999Z")
              .pageSize(100)
              .page(1)
              .build()
      );
      ```
    </CodeGroup>
  </Step>
</Steps>

### Example implementation

Here's a practical example showing how to retrieve a week of data using the
recommended approach. The same logic scales to retrieve months or years by
adjusting the date ranges and aggregation period (day/month):

<Expandable title="complete implementation code">
  <CodeGroup>
    ```python Python theme={null}
    import datetime
    import time
    from typing import List, Dict, Any

    from newscatcher import NewscatcherApi
    from newscatcher.core.api_error import ApiError

    import os
    from dotenv import load_dotenv

    load_dotenv()
    API_KEY = os.getenv("NEWSCATCHER_API_KEY")


    import datetime
    import time
    import json
    from typing import List, Dict, Any, Optional

    from newscatcher import NewscatcherApi
    from newscatcher.core.api_error import ApiError


    def retrieve_week_of_data(
        client: NewscatcherApi,
        query: str,
        start_date: datetime.datetime,
        end_date: datetime.datetime,
        output_file: Optional[str] = None,
    ) -> List[Dict[str, Any]]:
        """
        Retrieve a week of historical data using daily aggregation.

        Args:
            client: Configured NewscatcherApi client instance
            query: Search query string
            start_date: Start date for the week
            end_date: End date for the week
            output_file: Filename for JSON output (without .json extension)

        Returns:
            List of all articles retrieved for the entire week
        """
        results = []

        # Step 1: Get daily data volumes
        try:
            aggregation_response = client.aggregation.post(
                q=query,
                from_=start_date,
                to=end_date,
                aggregation_by="day",
                lang=["en"],
            )

            # Log daily volumes for planning
            if aggregation_response.aggregations:
                print("Daily data volumes:")
                aggregation_data = aggregation_response.aggregations[0]["aggregation_count"]
                for day_data in aggregation_data:
                    print(
                        f"  {day_data['time_frame']}: {day_data['article_count']} articles"
                    )
                print(f"Total articles expected: {aggregation_response.total_hits}")

        except ApiError as e:
            print(f"Error getting aggregation data: {e.status_code} - {e.body}")
            return results

        # Step 2: Process each day in the week
        current_date = start_date.date()
        end_date_only = end_date.date()

        while current_date <= end_date_only:
            # Set time bounds for the current day
            day_start = datetime.datetime.combine(current_date, datetime.time.min)
            day_end = datetime.datetime.combine(current_date, datetime.time.max)

            print(f"Processing {current_date}")

            current_page = 1
            total_pages = 1
            daily_articles = 0

            # Step 3: Paginate through the day's data
            while current_page <= total_pages:
                try:
                    response = client.search.post(
                        q=query,
                        from_=day_start,
                        to=day_end,
                        lang=["en"],
                        page=current_page,
                        page_size=100,
                    )

                    # Add articles to results (store as original JSON/dict)
                    if response.articles:
                        # Convert to dict to ensure JSON serialization works
                        for article in response.articles:
                            results.append(article.__dict__)
                        daily_articles += len(response.articles)

                    # Get pagination info from response
                    total_pages = response.total_pages or 1
                    current_page += 1

                    print(f"  Retrieved page {current_page - 1} of {total_pages}")

                    # Add delay between requests to respect rate limits
                    time.sleep(1)

                except ApiError as e:
                    if e.status_code == 408:
                        print(
                            "  Request timeout. The time window might contain too many articles."
                        )
                        # For daily windows, this is less likely, but could divide into hours if needed
                    elif e.status_code == 429:
                        print("  Rate limit hit. Waiting longer...")
                        time.sleep(5)
                        continue  # Retry the same page
                    else:
                        print(f"  API Error ({e.status_code}): {e.body}")
                    # Break the pagination loop on non-recoverable errors
                    break
                except Exception as e:
                    print(f"  Unexpected error: {e}")
                    break

            print(
                f"  Completed {current_date}, retrieved {daily_articles} articles for this day"
            )
            print(f"  Total articles so far: {len(results)}")

            # Move to next day
            current_date += datetime.timedelta(days=1)

        # Save results if output file specified
        if output_file and results:
            save_articles_to_json(results, output_file)

        return results


    def save_articles_to_json(articles: List[Dict[str, Any]], filename: str):
        """Save articles array to JSON file."""
        json_filename = f"{filename}.json"
        with open(json_filename, "w", encoding="utf-8") as f:
            json.dump(articles, f, indent=2, ensure_ascii=False, default=str)
        print(f"Saved {len(articles)} articles to {json_filename}")


    def main():
        """Test the weekly data retrieval function with transport strike query."""
        client = NewscatcherApi(api_key=API_KEY)

        # Define the test week (adjust dates as needed)
        start_date = datetime.datetime(2025, 5, 15)
        end_date = datetime.datetime(2025, 5, 22)

        # Your complex transport strike query
        query = '(airport OR "freight port" OR train) AND (strike OR "union protest" OR "planned closure" OR "worker dispute") AND NOT (past OR historical OR ended)'

        try:
            print(
                f"Testing weekly data retrieval from {start_date.date()} to {end_date.date()}"
            )
            print(f"Query: {query}")
            print("=" * 80)

            # Generate output filename based on date range
            output_file = f"transport_strikes_{start_date.strftime('%Y%m%d')}_{end_date.strftime('%Y%m%d')}"

            articles = retrieve_week_of_data(
                client, query, start_date, end_date, output_file=output_file
            )

            print("=" * 80)
            print(f"SUCCESS: Retrieved {len(articles)} articles total")

            # Optional: Show some sample data
            if articles:
                print("\nSample articles:")
                for i, article in enumerate(articles[:3]):  # Show first 3 articles
                    print(f"{i+1}. {article.get('title', 'No title')}")
                    print(f"   Published: {article.get('published_date', 'Unknown date')}")
                    print(f"   Source: {article.get('name_source', 'Unknown source')}")
                    print(f"   URL: {article.get('link', 'No URL')}")
                    print()

        except Exception as error:
            print(f"FAILED: {error}")


    if __name__ == "__main__":
        main()
    ```

    ```typescript TypeScript theme={null}
    import * as dotenv from "dotenv";
    import * as fs from "fs/promises";
    import {
      NewscatcherApi,
      NewscatcherApiClient,
      NewscatcherApiError,
    } from "newscatcher-sdk";

    // Load environment variables
    dotenv.config();
    const API_KEY = process.env.NEWSCATCHER_API_KEY;

    /**
     * Retrieve a week of historical data using daily aggregation.
     *
     * @param client - Configured NewscatcherApiClient instance
     * @param query - Search query string
     * @param startDate - Start date for the week
     * @param endDate - End date for the week
     * @param outputFile - Optional filename for JSON output (without .json extension)
     * @returns Promise resolving to array of all articles for the week
     */
    async function retrieveWeekOfData(
      client: NewscatcherApiClient,
      query: string,
      startDate: Date,
      endDate: Date,
      outputFile?: string
    ): Promise<any[]> {
      const results: any[] = [];

      // Step 1: Get daily data volumes
      try {
        const aggregationRequest: NewscatcherApi.AggregationPostRequest = {
          q: query,
          from: startDate,
          to: endDate,
          aggregationBy: "day",
          lang: ["en"],
        };

        const aggregationResponse = await client.aggregation.post(
          aggregationRequest
        );

        // Log daily volumes for planning
        const responseData = aggregationResponse as any;
        if (
          responseData.aggregations &&
          responseData.aggregations[0]?.aggregationCount
        ) {
          console.log("Daily data volumes:");
          const aggregationData = responseData.aggregations[0].aggregationCount;
          for (const dayData of aggregationData) {
            console.log(`  ${dayData.timeFrame}: ${dayData.articleCount} articles`);
          }
          console.log(`Total articles expected: ${responseData.totalHits}`);
        }
      } catch (error) {
        if (error instanceof NewscatcherApiError) {
          console.error(
            `Error getting aggregation data: ${error.statusCode} - ${error.message}`
          );
        } else {
          console.error(`Error getting aggregation data: ${error}`);
        }
        return results;
      }

      // Step 2: Process each day in the week
      const currentDate = new Date(startDate);
      const endDateTime = new Date(endDate);

      while (currentDate <= endDateTime) {
        // Set time bounds for the current day
        const dayStart = new Date(currentDate);
        dayStart.setHours(0, 0, 0, 0);

        const dayEnd = new Date(currentDate);
        dayEnd.setHours(23, 59, 59, 999);

        console.log(`Processing ${currentDate.toISOString().split("T")[0]}`);

        let currentPage = 1;
        let totalPages = 1;
        let dailyArticles = 0;

        // Step 3: Paginate through the day's data
        while (currentPage <= totalPages) {
          try {
            const searchRequest: NewscatcherApi.SearchPostRequest = {
              q: query,
              from: dayStart,
              to: dayEnd,
              lang: ["en"],
              page: currentPage,
              pageSize: 100,
            };

            const response = await client.search.post(searchRequest);
            const responseData = response as any;

            // Add articles to results
            if (responseData.articles && Array.isArray(responseData.articles)) {
              results.push(...responseData.articles);
              dailyArticles += responseData.articles.length;
            }

            // Get pagination info from response
            totalPages = responseData.totalPages || 1;
            currentPage++;

            console.log(`  Retrieved page ${currentPage - 1} of ${totalPages}`);

            // Add delay between requests to respect rate limits
            await new Promise((resolve) => setTimeout(resolve, 1000));
          } catch (error) {
            if (error instanceof NewscatcherApiError) {
              if (error.statusCode === 408) {
                console.log(
                  "  Request timeout. The time window might contain too many articles."
                );
              } else if (error.statusCode === 429) {
                console.log("  Rate limit hit. Waiting longer...");
                await new Promise((resolve) => setTimeout(resolve, 5000));
                continue; // Retry the same page
              } else {
                console.log(`  API Error (${error.statusCode}): ${error.message}`);
              }
            } else {
              console.log(`  Unexpected error: ${error}`);
            }
            // Break the pagination loop on non-recoverable errors
            break;
          }
        }

        console.log(
          `  Completed ${
            currentDate.toISOString().split("T")[0]
          }, retrieved ${dailyArticles} articles for this day`
        );
        console.log(`  Total articles so far: ${results.length}`);

        // Move to next day
        currentDate.setDate(currentDate.getDate() + 1);
      }

      // Save results if output file specified
      if (outputFile && results.length > 0) {
        await saveArticlesToJson(results, outputFile);
      }

      return results;
    }

    /**
     * Save articles array to JSON file.
     */
    async function saveArticlesToJson(
      articles: any[],
      filename: string
    ): Promise<void> {
      const jsonFilename = `${filename}.json`;
      try {
        await fs.writeFile(
          jsonFilename,
          JSON.stringify(articles, null, 2),
          "utf-8"
        );
        console.log(`Saved ${articles.length} articles to ${jsonFilename}`);
      } catch (error) {
        console.error(`Error saving file: ${error}`);
      }
    }

    /**
     * Example usage of the weekly data retrieval function.
     */
    async function main(): Promise<void> {
      if (!API_KEY) {
        console.error("NEWSCATCHER_API_KEY environment variable is required");
        console.error("Please check your .env file");
        process.exit(1);
      }

      const client = new NewscatcherApiClient({
        apiKey: API_KEY,
      });

      // Define the date range (adjust as needed)
      const startDate = new Date("2025-05-15T00:00:00.000Z");
      const endDate = new Date("2025-05-22T23:59:59.999Z");

      // Example query for transport strikes
      const query =
        '(airport OR "freight port" OR train) AND (strike OR "union protest" OR "planned closure" OR "worker dispute") AND NOT (past OR historical OR ended)';

      try {
        console.log(
          `Retrieving data from ${startDate.toISOString().split("T")[0]} to ${
            endDate.toISOString().split("T")[0]
          }`
        );
        console.log(`Query: ${query}`);
        console.log("=".repeat(80));

        // Generate output filename based on date range
        const formatDate = (date: Date) =>
          date.toISOString().split("T")[0].replace(/-/g, "");
        const outputFile = `articles_${formatDate(startDate)}_${formatDate(
          endDate
        )}`;

        const articles = await retrieveWeekOfData(
          client,
          query,
          startDate,
          endDate,
          outputFile
        );

        console.log("=".repeat(80));
        console.log(`SUCCESS: Retrieved ${articles.length} articles total`);

        // Show some sample data
        if (articles.length > 0) {
          console.log("\nSample articles:");
          for (let i = 0; i < Math.min(3, articles.length); i++) {
            const article = articles[i];
            console.log(`${i + 1}. ${article.title || "No title"}`);
            console.log(
              `   Published: ${
                article.publishedDate || article.published_date || "Unknown date"
              }`
            );
            console.log(
              `   Source: ${
                article.nameSource || article.name_source || "Unknown source"
              }`
            );
            console.log(`   URL: ${article.link || "No URL"}`);
            console.log();
          }
        }
      } catch (error) {
        console.error(`FAILED: ${error}`);
      }
    }

    // Run the main function
    if (require.main === module) {
      main();
    }
    ```
  </CodeGroup>
</Expandable>

<Note>
  For detailed guidance on retrieving large datasets, see
  [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles).
</Note>

## Common pitfalls to avoid

| Pitfall                              | Impact                                        | Solution                                                                             |
| ------------------------------------ | --------------------------------------------- | ------------------------------------------------------------------------------------ |
| Querying multiple years at once      | Slow performance, timeouts (408 errors)       | Break queries into monthly chunks                                                    |
| Using overly broad search terms      | Excessive result volume                       | Refine query terms to be more specific                                               |
| Insufficient error handling          | Failed data retrieval                         | Implement robust retry and error handling                                            |
| Underestimating data volume          | Resource constraints                          | Use aggregation endpoint to estimate volume first                                    |
| Requesting too many results per page | Slow response times                           | Use reasonable page sizes (100-1000)                                                 |
| Improper pagination implementation   | Incomplete data retrieval                     | See [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles)      |
| Expecting NLP data before July 2023  | `nlp` field is present but returned as `{}`   | Set `has_nlp=true` to filter for NLP-enriched articles only                          |
| Not prioritizing recent data         | Slower iteration when validating a new query  | Start with a recent short range to validate results before querying the full history |
| Missing delays between requests      | `429` errors interrupting long retrieval jobs | Add a delay between requests and implement exponential backoff on `429` responses    |

## See also

* [API Reference](/news-api/api-reference/search/search-articles-post)
* [Retrieve large datasets](/news-api/how-to/retrieve-more-than-10k-articles)
* [Handle errors](/news-api/troubleshooting/error-handling)
