Want to Get Rid of Documents with Duplicate Content?
Join the DZone community and get the full member experience.
Join For FreeWhether you’re combining data from two different data sources, have multiple purchases from the same customer or just entered the same data in a web form twice, it seems like everyone faces the problem of duplicate data at one point or the other.
In this blog post, we'll look at using views in Couchbase Server 2.0 to find matching fields among documents and retain the non duplicate documents. For the sake of this example, assume each document has three common user specified fields - first_name, last_name, postal_code. Using the ruby client for Couchbase Server and the faker ruby gem, you can build a simple data generator to load some sample duplicate data into Couchbase. To use ruby as a programming language with Couchbase, you should download the Ruby SDK here.
Here is an execution sample:
$ ruby ./generate.rb --help Usage: generate.rb [options] -h, --hostname HOSTNAME Hostname to connect to (default: 127.0.0.1:8091) -u, --user USERNAME Username to log with (default: none) -p, --passwd PASSWORD Password to log with (default: none) -b, --bucket NAME Name of the bucket to connect to (default: default) -t, --total-records NUM The total number of the records to generate (default: 10000) -d, --duplicate-rate NUM Each NUM-th record will be duplicate (default: 30) -?, --help Show this message $ ruby ./generate.rb -t 1000 -d 5 1000 / 1000
Step 1
function (doc, meta) { emit([doc.first_name + '-' + doc.last_name + '-' + doc.postal_code], meta.id); }
Step 2
The reduce function looks like -
function (keys, values, rereduce) { if (rereduce) { var res = []; for (var i = 0; i < values.length; i++){ res = res.concat(values[i]) } return res; } else { return values; } }
After grouping, if there are more than one meta.id values, we concatenate them to get a list of meta.id's refering to a duplicate document.
Step 3
require 'couchbase' connection = Couchbase.connect(options) ddoc = connection.design_docs[options[:design_document]] view = ddoc.send(options[:view]) connection.run do view.each(:group => true) do |doc| dup_num = doc.value.size if dup_num > 1 puts "left doc #{doc.value[0]}, " # delete documents from second to last connection.delete(doc.value[1..-1]) puts "removed #{dup_num} duplicate(s)" end end end
If the number of meta.id’s in the value array is greater than 2, there are duplicate documents corresponding to that key. As shown in the figure above id19 and id20 are duplicate documents.


Published at DZone with permission of Baxter Denney. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments