GitHub Is Bad for AI: Solving the ML Reproducibility Crisis
GitHub is not suitable for machine learning because ML isn’t just code, and that’s what GitHub was made for.
Join the DZone community and get the full member experience.Join For Free
There is a crisis in machine learning that is preventing the field from progressing as fast as it could. It stems from a broader predicament surrounding reproducibility that impacts scientific research in general. A Nature survey of 1,500 scientists revealed that 70% of researchers have tried and failed to reproduce another scientist’s experiments, and over 50% have failed to reproduce their own work. Reproducibility, also called replicability, is a core principle of the scientific method and helps ensure the results of a given study aren’t a one-off occurrence but instead represent a replicable observation.
In computer science, reproducibility has a more narrow definition: Any results should be documented by making all data and code available so that the computations can be executed again with the same results. Unfortunately, artificial intelligence (AI) and machine learning (ML) are off to a rocky start when it comes to transparency and reproducibility. For example, take this response published in Nature by 31 scientists that are highly critical of a study from Google Health that documented successful trials of AI that detects signs of breast cancer.
The skeptical scientists claim the Google study offered far too little detail about how the AI model was built and tested and went so far as to say it was merely an advertisement for proprietary technology. Without adequate information about how a given model was created, it is nearly impossible for the scientific community to review and reproduce its results. This is contributing to a growing perception that transparency is lacking in artificial intelligence, exacerbating trust issues between humans and AI systems.
To maintain forward momentum and succeed with artificial intelligence, it will be essential to address replicability and transparency issues in the field. This article explains the significance of the reproducibility crisis on AI, as well as how a new version of GitHub built specifically for machine learning could help solve it.
Why We Need a GitHub Built Specifically for Machine Learning
GitHub is a cloud-based service for developing and managing code. The platform is used for software version control, which helps developers track changes to code throughout the development lifecycle. This makes it possible to safely branch and merge projects and ensure code is reproducible, working the same way regardless of who is running it. Because AI and ML applications are written in code, GitHub was the natural choice for managing them. Unfortunately, a number of differences between AI and more traditional software projects make GitHub a bad fit for artificial intelligence, contributing to the reproducibility crisis in machine learning.
GitHub Wasn’t Designed With Data as a Core Project Component
Traditional software algorithms are created by developers taking ideas out of their heads and writing them as code in a deterministic, mathematical, Turing-complete language. This makes software highly replicable—all that is needed to reproduce a given piece of software is its code and the libraries used for task optimization.
Machine learning algorithms are different because they aren’t created from the minds of developers but instead implied from data. This means that if the data changes the machine learning algorithm changes, even if the code and operating environment variables recorded in traditional software development remain constant. This is the heart of the problem with using GitHub for AI: Even if you track the code and libraries used to develop an artificial intelligence algorithm, you can’t reproduce it because it depends on the data, not just the code. Some ways to overcome this include:
- Automated data versioning: In order to avoid replicability issues that stem from inconsistent training datasets, data versioning must be a key feature of any platform designed to manage AI/ML projects. This gives teams an automated way to track all changes made to data, ensuring that results can be tied to the specific version of the training dataset that informs them. Although GitHub today can track changes to code, it can’t track data. Overcoming this will play a key role in solving the reproducibility crisis in AI.
- Immutable data lineage: An immutable data lineage provides an unchangeable record for all activities and assets in the machine learning lifecycle associated with data. This enables ML teams to track every version of their code, models, and data. By providing an immutable record for all activities associated with an ML model from training to production, reproducibility is safeguarded and the relationships between historical datasets are better managed.
Artificial Intelligence Uses Massive, Unstructured Datasets
It isn’t just the inability to track changes in data that makes using GitHub for AI problematic, traditional software and AI also depend on completely different data types. Software is written in code, and code is expressed as text. By nature, text files are not very large. Conversely, artificial intelligence relies on unstructured data, such as audio, images, and video, which are far bigger than text files and therefore present additional data tracking and management challenges.
The process by which data from multiple sources is combined into a single data store is called extract, transform, and load (ETL). This is a general process for replicating data from source systems to target systems, and it makes it possible for different types of data to work together. Data scientists and engineers need data versioning, data lineage, the ability to handle large files, as well as manage the script and libraries used for data processing in order to extract, transform, and load data for use in AI application development.
Some emerging solutions to this problem are discussed later in the article, but it is important to note that this functionality is not currently built into the core of GitHub–making it impossible to properly manage the data that inform machine learning algorithms on the platform.
ML Model Parameters Introduce Additional Complexity
These issues with AI replicability and using GitHub for ML projects extend beyond just the inability to track changes in data and manage large, unstructured datasets. Even if the code, libraries, and data used to develop an artificial intelligence algorithm remain constant, it still wouldn’t be possible to replicate the same results using the same AI system because of variability in model parameters.
As mentioned before, machine learning algorithms are informed by data. However, this isn’t the only factor that influences the system. Parameters are other inputs that contribute to how a given algorithm functions. There are two types of model parameters, hyperparameters, and just plain parameters. Hyperparameters can be thought of as high-level controls for the learning process that influence the resulting parameters of a given model. After ML model training is complete, parameters are what represent the model itself. Hyperparameters, although used by the learning algorithm during training, are not part of the resulting model.
By definition, hyperparameters are external to an ML model and their value cannot be estimated from data. Changes to hyperparameters result in changes to the exact algorithm that the machine learning model ultimately learns. If the code is the design of how to build a human brain, the hyperparameters and models are how to build your exact brain. This is important because the same code base used to train a model can generate hundreds or thousands of different parameters.
Experiment Results Tracking and Code Review
When testing machine learning models it is important to track experimental results. These results help determine which model is the best fit for production, and unsurprisingly GitHub wasn’t designed to record these details. Although it is possible to build a custom workaround, this solution doesn’t scale and is inaccessible to many developers due to time and resource constraints.
Managing a machine learning model also involves code review and version tracking, which is where GitHub excels. Although GitHub tracks code and environment variables very well, machine learning introduces the need to track data, parameters, metadata, experimental results, and much more. The Git platform was not built to accommodate this level of sophistication but, fortunately, there are some emerging solutions that attempt to overcome the limitations of GitHub for AI and ML.
Alternatives to GitHub for AI and ML
There is no single alternative to GitHub that offers a comprehensive solution for managing AI and ML projects. Ideally, a GitHub specifically tailored for machine learning will become available to data scientists and engineers operating in this space. Until then, there are a number of solutions that address different issues mentioned above:
- Neptune is a metadata store for MLOps that offers a single place to log, store, display, organize, compare, and query all ML model-building metadata. Documentation is available for using Neptune for data versioning. This includes versioning datasets in model training runs, comparing datasets between runs, and organizing and sharing dataset versions.
- Pachyderm is a data layer used to enhance the machine learning lifecycle. The company offers solutions for automated data versioning and an immutable data lineage.
- DVC is an open-source version control system built for machine learning projects. The tool allows data scientists and engineers to save and reproduce experiment results, control versions of models and data, as well as establish processes for deployment and collaboration.
- Git Large File Storage (Git LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing file contents on a remote server. This tool is an open-source Git extension for versioning large files, such as audio and video datasets. It aims to help developers work more efficiently with large files and binary files.
- DoIt is an SQL database that can be forked, cloned, branched, merged, pushed, and pulled in the exact same way as a Git repository. It positions itself as “Git for data,” playing on the shortcomings of GitHub for data management outlined above. Although DoIt is commonly used for version tracking to ensure consistent model replicability, among a variety of other use cases.
- LakeFS is a data management tool that is available in both open-source paid software as a service (SaaS) versions. This solution emphasizes full reproducibility of data and code, rapid data reversion, and petabyte-scale version control.
- Delta Lake is an open-source project that enables building a Lakehouse Architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. Some features of this solution that make it a good option for machine learning include an open protocol for data sharing, scalable metadata handling, data versioning, and the ability to view an audit history of every change made to data.
Published at DZone with permission of Brad Cordova. See the original article here.
Opinions expressed by DZone contributors are their own.