Elastic search duplicate document check

The Basic script to find the duplicate count is as below, but we will not get the complete information of the documents as the bucket size is limited to 10 by default.

GET /index/type/_search

{

"size":0,

"aggs" : {

"db" : {

"terms" : {

"field" : "source-dbtype"

"aggs" : {

"count" : {

"terms" : {

"field" : "column_name","min_doc_count": 2

}

To get the complete details we need to use cardinality aggregation as shown below.

Cardinality Aggregation

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

GET index/type/_search

{

"size": 0,

"aggs": {

"maximum_match_counts": {

"cardinality": {

"field": "column_name",

"precision_threshold": 100

}

get value of maximum_match_counts aggregations

Now you can get all duplicate userids

GET index/type/_search

{

"size": 0,

"aggs": {

"column_name": {

"terms": {

"field": "column_name",

"size": maximum_match_counts,

"min_doc_count": 2

}

This will give you the complete output of the duplicates in your index.

Hope this helps :)

Author: Adil Mohammed

Algae Education Services

Labels

Elastic search duplicate document check

The Basic script to find the duplicate count is as below, but we will not get the complete information of the documents as the bucket size is limited to 10 by default.

Cardinality Aggregation

No comments:

Followers

Categories

Total Pageviews

Popular Posts

Authors

Meet US

Services

More Services