Finding Unique Values in Elasticsearch

Introduction

This document explains how to perform an elasticsearch query for the unique values of a field along with the counts. This is the equivalent of doing the following in SQL: the example below counts the unique values for the field title in the table titles.

SELECT title, count(*)
FROM   Titles
GROUP BY title

Which results in the unique values and counts in the titles table.

+--------------------+----------+
| title              | count(*) |
+--------------------+----------+
| Assistant Engineer |    15128 |
| Engineer           |   115003 |
| Manager            |       24 |
| Senior Engineer    |    97750 |
| Senior Staff       |    92853 |
| Staff              |   107391 |
| Technique Leader   |    15159 |
+--------------------+----------+

Let us see how we can accomplish the same thing in Elasticsearch.

Using a terms aggregation

The first thing we attempt is the term aggregation. This aggregation is used to find the top 10 unique values in a field.

Here is what the query looks like. We are finding the unique values for the field names Area. The .keyword tells elastic search to aggregate this field as a keyword and not a full text search.

{
  "aggs": {
    "keys": {
      "terms": {
        "field": "Area.keyword"
      }
    }
  },
  "size": 0
}

(The parameter setting size = 0 tells elastic search we do not want a listing of the hits).

Here is the output showing the top 10 unique values and the counts.

...
"buckets" : [
  {
    "key" : "World",
    "doc_count" : 3738
  },
  {
    "key" : "Americas",
    "doc_count" : 3444
  },
  {
    "key" : "Annex I countries",
    "doc_count" : 3444
  },
  {
    "key" : "Asia",
    "doc_count" : 3444
  },
  {
    "key" : "Australia",
    "doc_count" : 3444
  },
  {
    "key" : "Australia & New Zealand",
    "doc_count" : 3444
  },
  {
    "key" : "China",
    "doc_count" : 3444
  },
  {
    "key" : "China, mainland",
    "doc_count" : 3444
  },
  {
    "key" : "Eastern Asia",
    "doc_count" : 3444
  },
  {
    "key" : "India",
    "doc_count" : 3444
  }
]
...

The terms aggregation accepts a size parameter for specifying how many top unique values you want (default is 10). For example, you can ask for 100 values like this:

{
  "aggs": {
    "keys": {
      "terms": {
        "field": "Area.keyword",
        "size": 100
      }
    }
  },
  "size": 0
}

And you end up with a maximum of 100 unique values.

But, what if you want a listing of all the unique values? Should you specify a large number for the size, say 10000?

Using a composite aggregation

Actually, that is the wrong approach for the following reasons.

  • There may be more than 10000 values in the field. To get the others you will have to specify a larger number such as 50000 and try again, and perhaps again.
  • All the values will be returned in the response which means that much amount of memory will be required to parse and process them. Also considering that you may be required to do this again and again, your program might be sluggish.

The solution recommended by elasticsearch for this situation is to use a composite aggregation. Advantages of using a composite aggregation:

  • Allows you to paginate and scroll through all the unique values.
  • You will not need to know how many unique values are present before hand.

The composite aggregation is executed in two parts: the first part starts the aggregation and the second and subsequent parts continue where the previous one left off. This is accomplished by setting the after parameter to the last unique value in the previous batch.

For the sake of illustration, we set the number of unique values in one batch to 5. Here is the first part of the composite aggregation with no after parameter.

{
  "from": 0, 
  "aggs": {
    "keys": {
      "composite": {
        "sources": [
          {
            "Area.keyword": {
              "terms": {
                "field": "Area.keyword"
              }
            }
          }
        ], 
        "size": 5
      }
    }
  }, 
  "size": 0
}

The response for the search results in 5 unique values as requested.

"buckets" : [
  {
    "key" : {
      "Area.keyword" : "Afghanistan"
    },
    "doc_count" : 3017
  },
  {
    "key" : {
      "Area.keyword" : "Africa"
    },
    "doc_count" : 3381
  },
  {
    "key" : {
      "Area.keyword" : "Albania"
    },
    "doc_count" : 3381
  },
  {
    "key" : {
      "Area.keyword" : "Algeria"
    },
    "doc_count" : 3377
  },
  {
    "key" : {
      "Area.keyword" : "American Samoa"
    },
    "doc_count" : 2743
  }
]

As you can see above, the last unique value is American Samoa. We pass this value as the after parameter to get the next batch of results.

{
  "from": 0, 
  "aggs": {
    "keys": {
      "composite": {
        "sources": [
          {
            "Area.keyword": {
              "terms": {
                "field": "Area.keyword"
              }
            }
          }
        ], 
        "after": {
          "Area.keyword": "American Samoa"
        }, 
        "size": 5
      }
    }
  }, 
  "size": 0
}

The next batch of results are shown below.

"buckets" : [
  {
    "key" : {
      "Area.keyword" : "Americas"
    },
    "doc_count" : 3444
  },
  {
    "key" : {
      "Area.keyword" : "Angola"
    },
    "doc_count" : 3318
  },
  {
    "key" : {
      "Area.keyword" : "Anguilla"
    },
    "doc_count" : 2438
  },
  {
    "key" : {
      "Area.keyword" : "Annex I countries"
    },
    "doc_count" : 3444
  },
  {
    "key" : {
      "Area.keyword" : "Antigua and Barbuda"
    },
    "doc_count" : 3298
  }
]

For the next batch of results, we use Antigua and Barbuda as the after value. And so on till all the results are exhausted.

Source: https://www.getargon.io/docs/articles/elasticsearch/unique-values.html

Bạn nghĩ gì về bài viết này?