Histogram aggregation in Elasticsearch

Introduction

A histogram is a representation of numerical data grouped into contiguous groups based on the frequency of occurence. It is used to indicate the distribution of data. For example, height of subjects in a population is a continuous variable (meaning it can take all values within its range), and it can be grouped into ranges such as: less than 5 feet, 5 feet to 5 feet 30 inches, and so on. The number of instances in each group is a histogram of its distribution.

The histogram aggregation

Let us look at how to generate a histogram using Elasticsearch. An example histogram request looks like this:

{
  "aggs": {
    "histogram": {
        "histogram": {
        "field": "Value",
        "interval": 200
        }
    }
  }
}

Using the stats aggregation to determine the interval

As shown above, the histogram aggregation requires an interval parameter which determines how many classes the histogram represents.

To determine a suitable interval, we need to know the minimum and maximum values on the field. These can be determined using the stats aggregation as follows:

{
  "aggs": {
    "stats": {
      "stats": {
        "field": "Value"
      }
    }
  },
  "size": 0
}

The stats aggregation is returned for the field Value as follows:

"aggregations": {
  "stats": {
    "count": 209,
    "min": 1879.0,
    "max": 3768.0,
    "avg": 2862.8803827751194,
    "sum": 598342.0
  }
}

Assuming we want about 10 buckets in the histogram, we choose 200 for the interval parameter.

A histogram example

Le us now look at an example. For our example, we have FAO data presenting the calories consumed by each person per day in each countries. The following query extracts the raw data for the year 2013.

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "Year": "2013"
          }
        },
        {
          "term": {
            "Element.keyword": "Food supply (kcal/capita/day)"
          }
        },
        {
          "term": {
            "Item.keyword": "Grand Total"
          }
        }
      ]
    }
  }
}

Which returns data as follows:

"hits": [
  {
    "_index": "foodsupply_crops_e_all_data_(normalized).csv",
    "_type": "doc",
    "_id": "491193",
    "_score": 7.691229,
    "_source": {
      "Area Code": 27,
      "Area": "Bulgaria",
      "Item Code": 2901,
      "Item": "Grand Total",
      "Element Code": 664,
      "Element": "Food supply (kcal/capita/day)",
      "Year Code": 2013,
      "Year": 2013,
      "Unit": "kcal/capita/day",
      "Value": 2829.0,
      "Flag": "Fc"
    }
  },
  {
    "_index": "foodsupply_crops_e_all_data_(normalized).csv",
    "_type": "doc",
    "_id": "514831",
    "_score": 7.691229,
    "_source": {
      "Area Code": 233,
      "Area": "Burkina Faso",
      "Item Code": 2901,
      "Item": "Grand Total",
      "Element Code": 664,
      "Element": "Food supply (kcal/capita/day)",
      "Year Code": 2013,
      "Year": 2013,
      "Unit": "kcal/capita/day",
      "Value": 2720.0,
      "Flag": "Fc"
  }
...

Combining this query with the histogram aggregation as shown above, we have the following request:

{
    "query": {
    "bool": {
        "must": [
        {
            "term": {
            "Year": "2013"
            }
        }, 
        {
            "term": {
            "Element.keyword": "Food supply (kcal/capita/day)"
            }
        }, 
        {
            "term": {
            "Item.keyword": "Grand Total"
            }
        }
        ]
    }
    },
    "aggs": {
    "histo": {
        "histogram": {
        "field": "Value",
        "interval": 200
        }
    }
    },
    "size": 0
}

We have turned off the hits comprising the aggregation by setting size to 0. Executing this aggregation results in the following response:

"aggregations": {
  "histo": {
    "buckets": [
      {
        "key": 1800.0,
        "doc_count": 2
      },
      {
        "key": 2000.0,
        "doc_count": 11
      },
      {
        "key": 2200.0,
        "doc_count": 18
      },
      {
        "key": 2400.0,
        "doc_count": 29
      },
      {
        "key": 2600.0,
        "doc_count": 37
      },
      {
        "key": 2800.0,
        "doc_count": 30
      },
      {
        "key": 3000.0,
        "doc_count": 28
      },
      {
        "key": 3200.0,
        "doc_count": 29
      },
      {
        "key": 3400.0,
        "doc_count": 18
      },
      {
        "key": 3600.0,
        "doc_count": 7
      }
    ]
  }
}

Nguồn: https://www.getargon.io/docs/articles/aggregation/histogram.html

Bạn nghĩ gì về bài viết này?