Over a million developers have joined DZone.

FLOSS Moling with RavenDB

· Database Zone

To stay on top of the changing nature of the data connectivity world and to help enterprises navigate these changes, download this whitepaper from Progress Data Direct that explores the results of the 2016 Data Connectivity Outlook survey.

 

There is the FLOSS Mole data set, which provide a lot of interesting information about open source projects. As I am always interested in testing RavenDB with different data sets, I decided that this would be a great opportunity to do that, and get some additional information about how things are working as well.

The data is provided in a number of formats, but most of them aren’t really easy to access. SQL statements and raw text files that I assume to be tab separated, but I couldn’t really figure out quickly.

I decided that this would be a great example of actually migrating content from a SQL System to a RavenDB System. The first thing to do was to install MySQL, as that seems to be the easiest way to get the data out. (As a note, MySQL Workbench is really not what I would call nice.)

The data looks like this, this is the Google Code projects, and you can also see that a lot of the data is driven from the notion of a project.

image

I explored the data a bit, and I came to the conclusion that this is pretty simple stuff, overall. There are a few many to one associations, but all of them were capped (the max was 20 or so).

That meant, in turn, that we had a really simple work to do for the import process. I started by creating the actual model which we will use to save to RavenDB:

image

The rest was just a matter of reading from MySQL and writing to RavenDB. I chose to use Peta Poco for the SQL access, because it is the easiest. The following code sucks. It is written with the assumption that I know what the data sizes are, that the cost of making so many queries (roughly a 1,500,000 queries) is acceptable, etc.

    using (var docStore = new DocumentStore
        {
            ConnectionStringName = "RavenDB"
        }.Initialize())
    using (var db = new PetaPoco.Database("MySQL"))
    using (var bulk = docStore.BulkInsert())
    {
        foreach (var prj in db.Query<dynamic>("select * from gc_projects").ToList())
        {
            string name = prj.proj_name;
            bulk.Store(new Project
                {
                    Name = name,
                    CodeLicense = prj.code_license,
                    CodeUrl = prj.code_url,
                    ContentLicense = prj.content_license,
                    ContentUrl = prj.content_url,
                    Description = prj.project_description,
                    Summary = prj.project_summary,
                    Labels = db.Query<string>("select label from gc_project_labels where proj_name = @0", name)
                                    .ToList(),
                    Blogs = db.Query<dynamic>("select * from gc_project_blogs where proj_name = @0", name)
                                .Select(x => new Blog { Link = x.blog_link, Title = x.blog_title })
                                .ToList(),
                    Groups = db.Query<dynamic>("select * from gc_project_groups where proj_name = @0", name)
                                .Select(x => new Group { Name = x.group_name, Url = x.group_url })
                                .ToList(),
                    Links = db.Query<dynamic>("select * from gc_project_links where proj_name = @0", name)
                                .Select(x => new Link { Url = x.link, Title = x.link_title })
                                .ToList(),
                    People = db.Query<dynamic>("select * from gc_project_people where proj_name = @0", name)
                        .Select(x => new Person
                            {
                                Name = x.person_name,
                                Role = x.role,
                                UserId = x.user_id
                            })
                        .ToList(),
                });
        }
    }

But, it does the work, and it was simple to write. Using this code, I was able to insert 299,949 projects in just under 13 minutes. Most of the time went to making those 1.5 million queries to the db, by the way.

Everything is cool, and it is quite nice. On the next post, I’ll talk about why I wanted a new dataset. Don’t worry, it is going to be cool.



Turn Data Into a Powerful Asset, Not an Obstacle with Democratize Your Data, a Progress Data Direct whitepaper that explains how to provide data access for your users anywhere, anytime and from any source.

Topics:

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}