DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Automating Data Science in a Big Data Environment

Automating Data Science in a Big Data Environment

What steps in the Big Data analytics process can be automated to save time and money?

Linda Gimmeson user avatar by
Linda Gimmeson
·
Jan. 28, 17 · Opinion
Like (1)
Save
Tweet
Share
6.63K Views

Join the DZone community and get the full member experience.

Join For Free

Everything seems to be automated these days, from driverless cars to BLS renewal online, but one of the most transformative ways automation can affect us is through the automation of large data science numbers.

Data science is growing increasingly important, and many organizations are trying to streamline the process with automation. The growth of technology has been both a curse and a blessing: paired with big data and the Internet of Things, data science is constantly changing with new data sets and conditions, causing the analyst to regularly maintain and re-create the models each time. This process can be tedious and time-consuming, but it can readily be replaced with automation. An automated system has the capability to solve a problem no matter what kind of data is input and can create all possible solutions to a potential problem, saving valuable time and energy for human workers.

However, automating data science in a big data environment can be a complex challenge, especially because there are still some areas that tend to require human interaction from a data scientists or software developer. Experts recommend thinking of data science automation as a two-level process where (1) separate data science components are automated and then (2) each individual automated piece is brought together to form a cohesive system.

There are four main areas that can be automated individually to create a fully automated system: data preparation, machine learning, domain knowledge, and result interpretation. These tasks can create automated models in three main areas:

Data Preparation

The first step of data science is the repetitive action of extracting, cleaning, and transforming data. Tasks can include inputting null values and transforming data for each specific algorithm. Many organizations that have automated this process use rule-based logic for the tasks, which might not be the best fit given the purpose of data science is to replace rule-based systems. The best automated system would be automated data preprocessing that is automated by machine learning, meaning we give machines more power to decide what function to apply to a data set.

Data preparation can also be automated through feature engineering, which converts raw data into predictors that increase the accuracy of a machine learning system. Feature engineering is still in the early stages of algorithm development. As the process is solidified, it could play a large role in the future of data science.

Machine Learning

In the manual world, this process is done by a statistician looking at the data to determine the best algorithm to use and then putting the information into a model. In the automated world, machines choose the best algorithm for the data and streamline the mathematical complexities to make the equation and results easily understandable. This process involves more advanced automation because a machine must recognize input patterns and self-optimize to set boundaries for the equations. More advanced automated systems use things like cloud-based servers and meta learning to automatically understand and compute huge amounts of data.

Insights Generation

The end results of data science is not a new set of data, it is the interpretation of that data in a way that can be applied to an organization. A programmer or statistician may understand an output of data and how that relates, but the process isn’t complete until the data can be understood by someone with no statistical knowledge. That means turning that data into a comprehensive and transparent story.

Automating this step is slightly more involved because it requires automatically creating user-friendly texts from the raw number results. The leading framework for this type of automation is Natural Language Generation (NLG), which best translates machine language into natural, human language. NLG frameworks include Nlgserv and simplenlg; Markov chains can also be used to automatically generate sentences and create stories.

The automation of data science is in the early stages and will continue to evolve as further technologies are developed and applied. After creating individual modules, the next step is to create more generic platforms that can automatically integrate all aspects of a data science system. The process could be lengthy, but the results could be powerful across the business world.

Data science Big data

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • RabbitMQ vs. Memphis.dev
  • PHP vs React
  • Building a Scalable Search Architecture
  • Top 5 Node.js REST API Frameworks

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: