Using GitHub as a Data Lake
Yes, you read that right. We find out how and why it's a good idea.
Join the DZone community and get the full member experience.Join For Free
There are many situations where we prefer using Amazon S3 as the destination for our date lakes, but increasingly we are also using GitHub as a data lake destination. While GitHub repositories do have some constraints when compared to Amazon S3, when it comes to specific types of big data projects it also has some significant advantages over Amazon S3. Providing us with a solution that can be checked out, forked, and version controlled, helping us stream the data we need across different applications.
Amazon S3 provides us with an industrial grade solution for streaming data in our data lakes. We are developing a growing number of AWS Lambda serverless apps for streaming data from common API sources into an Amazon S3 data lake. However, we are also developing a line of similar functions that stream the same data into GitHub repositories, providing more of a flexible alternative to developing real-time data lakes. Developing approaches to data storage that will allow developers to craft more precise, potentially distributed, and collaborative data lakes.
Using either the public or private repositories, we reguarly find ourselves turning on streams of data that flow into precise repositories, within coordinated GitHub organizations. Taking real-time data from Twitter, Reddit, Stack Overflow, and other APIs, and streaming it into specific repositories, that can then be checked out by developers in real-time, on specific schedules, or in response to events, and used to train machine learning models, or in any other application. While Amazon S3 and GitHub both have APIs, GitHub has the added Git layer, and the benefits of version control, and the other network effects of using GitHub.
GitHub isn't just for code. You can just as easily manage JSON, CSVs, and YAML data within repositories, turning repos into forkable data lakes. In this continuously deployable and integratable development landscape, data lakes aren't always about big data, it can also be about precise data stores, that can be checked out and used wherever it is needed. Making GitHub, GitLab, Bitbucket, and other source control systems an optimal solution for helping us manage data and to use as a destination for our streaming data lakes.
Published at DZone with permission of Kin Lane, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.