Search Engines October 26, 2013

Grouping in ElasticSearch
using child documents

How to answer questions such as "average number of orders per customer during November?" that would be easy using GROUP BY in a relational database isn't always obvious when using ElasticSearch. Here's one solution, using child documents.

This week I ran into a problem with a seemingly simple use case with ElasticSearch (v. 0.90.5). I had data such as this:

Customer ID Date
john 2013-11-15T12:00:00 
jane 2013-11-20T12:00:00
john 2013-12-01T12:00:00

What this data represents isn't very relevant. Depending on the naming of the columns shown above and other data that I've omitted here it could be comments, Facebook likes or just about any other type of event. However, for the sake of argument let's say that it represent orders.

I wanted to index the data, and a lot more of it, into an ElasticSearch index to enable me to answer questions such as:

  • How many orders have been placed?
  • How many orders have been placed during november?
  • How many orders have been placed during the current month compared to the previous month?
  • What is the average number of orders per customer?
  • What was the average number of orders per customer during November?

My initial instinct was to simply index each order as a document. Once I had done so I found that it was easy to answer the first three questions. However, the two last questions, about average number of orders per customers, wasn't so easy.

In fact, it was next to impossible. The only solution that I found (thanks to Henrik) was to search for orders and retrieve a terms facet for Customer ID. Using that I could divide the total number of search hits by the number of terms returned in the facet. However, ElasticSearch doesn't return the total number of unique terms with a terms facet.

So, after having retrieved the search result I would have to look at how many terms was returned in the array myself. While that would work in a scenario where there was only a small number of customers ElasticSearch wouldn't be very happy returning a huge number of terms in the facet. Not to mention that the response body would be huge.

My next idea was to make two indexing requests for each order, one indexing the order and one indexing, or rather upserting, a document representing the customer. By keeping track of the number of orders each customer had done I would be able to answer the fourth question; average number of orders per customer. However, that wouldn't enable me to answer the fifth question; average number of orders per customer during a specific timeframe.

Child documents to the rescue

After having banged my head against the wall for a while, Googled and read up on the state of field collapsing in ElasticSearch I finally found a workable solution; indexing orders and customers separately and mapping orders as child documents to customers. While this does require me to index multiple times it does allow me to answer questions about average number of orders per customer with a single search request.

To start with I created an index with a parent mapping for orders:

curl -XPUT "http://localhost:9200/orders" -d'
{
    "mappings": {
        "customer": {},
        "order" : {
            "_parent" : {
                "type" : "customer"
            }
        }
    }
}'

Inserting the three orders from the table above can be achieved like this:

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "john" } }
{ "name" : "John Doe" }
{ "index" : { "_type" : "order", "_parent" : "john" } }
{ "date" : "2013-11-15T12:00:00" }
'

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "jane" } }
{ "name" : "Jane Doe" }
{ "index" : { "_type" : "order", "_parent" : "jane" } }
{ "date" : "2013-11-20T12:00:00" }
'

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "john" } }
{ "name" : "John Doe" }
{ "index" : { "_type" : "order", "_parent" : "john" } }
{ "date" : "2013-12-01T12:00:00" }
'

There are two things to note in the requests above. First of all note that each order is indexed with a _parent. Second, I index the customer several times. That could of course be avoided if I could be sure that the customer already existed.

With the mappings done and the data indexed answering the question "What is the average number of orders per customer?" can be done by:

  • Searching across all types.
  • Adding a filter excluding all but orders. Note that this should be done using the filter part of the search request rather than in the query part.
  • Adding a filter facet to the request body. The filter facet filters out all documents except customers and requires each customer to have a child document of type order.

Here's how such a request looks:

curl -XPOST "http://localhost:9200/orders/_search" -d'
{
    "filter": {
        "type": {
            "value": "order"
        }
    },
    "facets": {
        "customer_count": {
            "filter": {
                "and": [{
                  "type": {
                      "value": "customer"
                  }  
                },
                {
                    "has_child" : {
                        "type" : "order",
                        "query" : {
                            "match_all" : {}
                        }
                    }
                }]
            }
        }
    },
    "size": 0
}'

The response from ElasticSearch looks like this:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": []
   },
   "facets": {
      "customer_count": {
         "_type": "filter",
         "count": 2
      }
   }
}

By dividing the total number of hits with the count returned in the filter facet we get the average number of orders per customer, the answer to the fourth question. To answer the fifth question, average number of orders per customer during a specific time frame we can modify the filter part of the request body, filtering not only by type but also by time frame. By doing so we get the total number of orders during the time frame. However, we also need to limit the number of customers so we also modify the filter facet to use a filtered query with the same filter. Like this:

curl -XPOST "http://localhost:9200/orders/_search" -d'
{
    "filter": {
        "and": [{
            "type": {
                "value": "order"
            }
        },
        {
            "range": {
                "date": {
                    "gte": "2013-11-01T00:00:00",
                    "lt": "2013-12-01T00:00:00"
                }
            }
        }]
    },
    "facets": {
        "customer_count": {
            "filter": {
                "and": [{
                  "type": {
                      "value": "customer"
                  }  
                },
                {
                    "has_child" : {
                        "type" : "order",
                        "query" : {
                            "filtered": {
                                "query": {
                                    "match_all": {}
                                },
                                "filter": {
                                    "range": {
                                        "date": {
                                            "gte": "2013-11-01T00:00:00",
                                            "lt": "2013-12-01T00:00:00"
                                        }
                                    }
                                }
                            }
                        }
                    }
                }]
            }
        }
    },
    "size": 0 }'

Conclusion

Using child documents and indexing pretty much the same data twice as described above solves my use cases. There may be other ways of accomplishing the same result but I've yet to find one and this seems to work well so far. However, if you know of a better/more efficient way of accomplishing the same goals, please let me know!

PS. For updates about new posts, sites I find useful and the occasional rant you can follow me on Twitter. You are also most welcome to subscribe to the RSS-feed.

Joel Abrahamsson

Joel Abrahamsson

I'm a passionate web developer and systems architect living in Stockholm, Sweden. I work as CTO for a large media site and enjoy developing with all technologies, especially .NET, Node.js, and ElasticSearch. Read more

Comments

comments powered by Disqus

More about Search Engines