Over a million developers have joined DZone.

Avoid where in a reduce clause

DZone's Guide to

Avoid where in a reduce clause

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

We got a customer question about a map/reduce index that produced the wrong results. The problem was a problem between the conceptual model and the actual model of how Map/Reduce actually works.

Let us take the following silly example. We want to find all the animal owner’s that have more than a single animal. We define an index like so:

// map
from a in docs.Animals
select new { a.Owner, Names = new[]{a.Name} }

// reduce
from r in results
group r by r.Owner into g
where g.Sum(x=>x.Names.Length) > 1
select new { Owner = g.Key, Names = g.SelectMany(x=>x.Names) }

And here is our input:

{ "Owner": "users/1", "Name": "Arava" }    // animals/1
{ "Owner": "users/1", "Name": "Oscar" }    // animals/2
{ "Owner": "users/1", "Name": "Phoebe" }   // animals/3

What would be the output of this index?

At first glance, you might guess that it would be:

{ "Owner": "users/1", "Names": ["Arava", "Oscar", "Phoebe" ] }

But you would be wrong. The actual output of this index… It is nothing. This index actually have no output.

But why?

To answer that, let us ask the following question. What would be the output for the following input?

{ "Owner": "users/1", "Name": "Arava" } // animals/1

That would be nothing, because it would be filtered by the where in the reduce clause. This is the underlying reasoning why this index has no output.

If we feed it the input one document at a time, it has no output. It is only if we give it all the data upfront that it has any output. But that isn’t how Map/Reduce works with RavenDB. Map/Reduce is incremental and recursive. Which means that we can (and do) run it on individual documents or blocks of documents independently. In order to ensure that, we actually always run the reduce function on the output of each individual document’s map result.

That, in turn, means that the index above has no output.

To write this index properly, I would have to do this:

// map
from a in docs.Animals
select new { a.Owner, Names = new[]{a.Name}, Count = 1 }

// reduce
from r in results
group r by r.Owner into g
select new { Owner = g.Key, Names = g.SelectMany(x=>x.Names), Count = g.Sum(x=>x.Count) }

And do the filter of Count > 1 in the query itself.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.


Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}