> ## Documentation Index > Fetch the complete documentation index at: https://newscatcherinc-docs.mintlify.site/docs/llms.txt > Use this file to discover all available pages before exploring further. # Clustering news articles > Group related articles together to identify trends and reduce noise in large volumes of news data. ## Overview Clustering groups articles by semantic similarity, not just keyword overlap. When you enable clustering on a [Search](/news-api/api-reference/search/search-articles-get) or [Latest Headlines](/news-api/api-reference/latest-headlines/retrieve-latest-headlines-get) request, the API returns articles organized into clusters — each cluster containing articles that cover the same story or topic. Use clustering to: * Identify how different sources cover the same story. * Spot emerging trends across large volumes of articles. * Track how a story develops over time as cluster composition changes. * Reduce manual organization work when processing high-volume news data. Clustering is available for [all languages supported by News API](/news-api/api-reference/enumerated-parameters#supported-language-codes). ## How clustering works Clustering involves two distinct stages that happen at different points in time: embedding generation, which happens as part of the data processing pipeline before any API request, and cluster formation, which happens at query time when you make a request with clustering enabled. ### Embedding generation (offline) As part of the data processing pipeline, each article is converted into a dense vector — called an *embedding* — that represents its semantic meaning. These embeddings are computed and stored when the article is indexed, not when you make a request. Articles about the same topic produce embeddings that are close together in vector space, even when they use different words. Diagram showing the offline data pipeline

Diagram showing the offline data pipeline

The embedding model and the fields used to generate embeddings depend on the article's publication date: | Date range | Embedding model | Fields used | | ---------------------- | --------------------- | ---------------------------------------------- | | Before 2026-01-01 | multilingual-e5-large | Configurable: `content`, `title`, or `summary` | | From 2026-01-01 onward | Qwen3-Embedding-0.6B | `title` + `content` (fixed) | | Spans both periods | — | Returns an error | **Clustering behavior changed on January 1, 2026.** For articles published from that date onward, the system uses a new embedding model and clustering algorithm. See the table above and the [parameter reference](#parameters) for details. ### Cluster formation (query time) When you make a request with `clustering_enabled=true`, the backend service retrieves the pre-computed embeddings for the articles that match your query, then runs the [Leiden graph community detection algorithm](https://en.wikipedia.org/wiki/Leiden_algorithm) to group them into clusters: 1. The cosine similarity between each pair of article embeddings is calculated. 2. Article pairs whose similarity score exceeds the `clustering_threshold` are connected as edges in a similarity graph. 3. The Leiden algorithm detects communities within that graph. 4. Each detected community becomes a cluster with a unique `cluster_id`. Diagram showing the Leiden graph community detection process

Diagram showing the Leiden graph community detection process

The Leiden algorithm produces more stable and accurate clusters than the previous density-based method because it optimizes community structure globally rather than locally. **Clustering does not support a date range that spans both before and after January 1, 2026.** If your `from_` date is before 2026-01-01 00:00:00 and your `to_` date is after, the API returns an error. The two date ranges use incompatible embedding spaces. Send separate requests for each period if you need data from both periods. ## Parameters To enable clustering and control its behavior, include the following parameters in your request:

Parameter	Type	Default	Description
`clustering\_enabled`	boolean	`false`	Set to `true` to enable clustering.
`clustering\_threshold`	float	`0.7`	Minimum cosine similarity required for two articles to be placed in the same cluster. Accepts values from `0` to `1`. Higher values produce smaller, more tightly related clusters.
`clustering\_variable`	string	`content`	Deprecated from January 1, 2026 onward. For pre-2026 data, specifies which field is used for embeddings: `content`, `title`, or `summary`. For post-2026 data, this parameter is ignored — clustering always uses `title` + `content`.

### Choosing a threshold The `clustering_threshold` value controls the trade-off between cluster size and topical precision: | Value | Effect | | ----- | ----------------------------------------------- | | `0.6` | Larger clusters; more topically diverse | | `0.7` | Balanced cluster size and similarity (default) | | `0.8` | Smaller clusters; more tightly related articles | ## Set page size for effective clustering Clustering operates on one page of results at a time. If related articles are split across multiple pages, they are clustered independently and may end up in separate clusters. To ensure that all related articles are considered together, set `page_size` to a value greater than or equal to the expected number of results for your query. For example, if your query is likely to return 150 articles, set `page_size` to at least `150`. ## Response structure When clustering is enabled, the API response includes the following fields at the top level: * `clusters_count`: The total number of clusters found. * `clusters`: An array of cluster objects. Each cluster object contains: * `cluster_id`: A unique identifier for the cluster. * `cluster_size`: The number of articles in the cluster. * `articles`: An array of article objects belonging to this cluster. ## Code example The following example searches for articles about renewable energy with clustering enabled, then prints a summary of each cluster. ```python clustering.py theme={null} import os import datetime from newscatcher import NewscatcherApi client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"]) response = client.search.post( q="renewable energy", lang=["en"], from_=datetime.datetime.fromisoformat("2026-04-01 00:00:00+00:00"), clustering_enabled=True, clustering_threshold=0.7, page_size=200, ) print(f"Found {response.clusters_count} clusters") for cluster in (response.clusters or []): print(f" Cluster {cluster.cluster_id}: {cluster.cluster_size} articles") print(f" First article: {cluster.articles[0].title}") ``` The response groups articles into cluster objects: ```json {7-11} theme={null} { "status": "ok", "total_hits": 182, "page": 1, "total_pages": 1, "page_size": 200, "clusters_count": 41, "clusters": [ { "cluster_id": "7222464423361803386", "cluster_size": 11, "articles": [ { "title": "Renewable Energy Investment Hits Record High in Q1", "author": "Jane Smith", "authors": ["Jane Smith"], "published_date": "2026-04-15 17:36:01", "published_date_precision": "full", "link": "https://example.com/renewable-energy-record", "domain_url": "example.com", "name_source": "Example News", "country": "US", "language": "en", "description": "Global investment in renewable energy reached...", "content": "Full article text...", "word_count": 542, // ...additional article fields "nlp": { "theme": "Business, Energy", "summary": "Article summary text...", "sentiment": { "title": 0.12, "content": 0.34 } // qwen_embedding omitted — 1024-float array; see Work with embeddings directly } } // ...additional articles in this cluster ] } // ...additional clusters ] } ``` ## Work with embeddings directly The same Qwen3 embeddings used for clustering are also available in the API response for you to use in your own pipelines. This is useful when the built-in clustering does not match your use case, or when you need to work with more articles than fit in a single request. Common use cases include: * Semantic search over your own article corpus. * Recommendation systems. * Deduplication with custom similarity thresholds. * Topic visualization. * Domain-specific clustering algorithms. Embeddings output requires the `v3_nlp_embeddings` subscription plan. Qwen3 embeddings is only available for articles indexed from January 1, 2026 onward. To request embeddings in the response, set `include_nlp_data=True`. Each article's embedding is returned in `article.nlp.qwen_embedding` as an array of 1024 floats. ```python fetch_embeddings.py theme={null} import os import datetime import numpy as np from newscatcher import NewscatcherApi client = NewscatcherApi(api_key=os.environ["NEWSCATCHER_API_KEY"]) response = client.search.post( q="artificial intelligence", lang=["en"], from_=datetime.datetime.fromisoformat("2026-01-01 00:00:00+00:00"), page_size=100, include_nlp_data=True, has_nlp=True, embeddings_output="qwen_embedding", ) articles = response.articles or [] # Build a matrix of embeddings — one row per article embeddings = np.array( [ getattr(art.nlp, "qwen_embedding") for art in articles if art.nlp and getattr(art.nlp, "qwen_embedding", None) is not None ], dtype=np.float32, ) print(f"Retrieved {len(articles)} articles") print(f"Embedding matrix shape: {embeddings.shape}") # (n_articles, 1024) ``` Once you have the embedding matrix, you can use it in your own pipelines. The following examples use `scikit-learn`, which is not included in `newscatcher-sdk`. Install it separately: ```bash theme={null} pip install scikit-learn ``` Find articles most similar to a reference article without making additional API calls: ```python semantic_similarity.py theme={null} from sklearn.metrics.pairwise import cosine_similarity # Similarity of every article against the first one scores = cosine_similarity(embeddings[0:1], embeddings)[0] ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True) print("Most similar articles:") for idx, score in ranked[1:6]: print(f" {score:.3f} {articles[idx].title}") ``` Apply any clustering algorithm that accepts dense vectors: ```python custom_clustering.py theme={null} from sklearn.cluster import AgglomerativeClustering model = AgglomerativeClustering(n_clusters=10, metric="cosine", linkage="average") labels = model.fit_predict(embeddings) for article, label in zip(articles, labels): print(f"Cluster {label}: {article.title}") ``` Project embeddings into 2D to explore topic structure: ```python visualize.py theme={null} from sklearn.decomposition import PCA import matplotlib.pyplot as plt coords = PCA(n_components=2).fit_transform(embeddings) plt.scatter(coords[:, 0], coords[:, 1], alpha=0.6) for i, art in enumerate(articles): plt.annotate(art.title[:40], coords[i], fontsize=6) plt.show() ``` ## Clustering vs. deduplication Clustering and deduplication both help organize large sets of articles, but they serve different purposes: | | Clustering | Deduplication | | --------------------------- | ------------------------------------- | ---------------------------------------- | | **Purpose** | Groups related articles | Removes near-identical articles | | **Output** | Groups of related articles | A set of unique articles | | **Similarity threshold** | Lower — allows broader groupings | Higher — identifies near-exact matches | | **Effect on article count** | Retains all articles | Removes duplicates | | **Best for** | Trend analysis, multi-source coverage | Removing redundancy, ensuring uniqueness | For more information, see [Articles deduplication](/news-api/guides-and-concepts/articles-deduplication).