Over a million developers have joined DZone.

5 Interesting Big Data Projects

DZone's Guide to

5 Interesting Big Data Projects

Big data has the potential to transform the way we approach a lot of problems. Read on to see how its being applied to several real-world issues.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Big data analytics has driven the last five years of machine learning. There are a lot of things that remain unexplored. One needs to have knowledge of frameworks such as the Hadoop Ecosystem to implement most big data projects. The MapReduce framework under Hadoop allows for massive scalability for distributed computing. Also, big data projects require immense processing power. There are two common ways one could get such processing power:

  • Setting up a Server or Distributed Cluster for Parallel Computing: One can set up a machine which has multiple cores and high memory. Multiple cores are required for multiple threads to run in parallel. Memory requirements are generally high for big data. Another approach could be setting up multiple smaller machines and distribute workload amongst them (this is more scalable as more machines can be added if needed). However, such setups are expensive and need a lot of setting up.
  • Cloud Computing: Companies such as Amazon, Google, and Microsoft have set up huge data centers for consumers. People can buy them as services according to their needs and time period. This is much better and requires negligible setup and far less expense than the former approach.

5 Interesting Big Data Use Cases

  • Crime Prediction: Machine learning has shown great scope for predicting crime. Historical data of crime locations, subjects, victim descriptions, time, and more can be used to model machine learning frameworks. However, there are huge amounts of data points. For a metropolitan area, a single day’s data would be enough to overload any average computer today. Hence, efficient and optimized models are required for fast processing for testing.

  • Analyzing Nuclear Physics Data: This would sound cool to a lot of people but it is equally complex. The likes of CERN release a lot of their data to the general public for analysis and research. This data is definitely not small. Normally, a single second of capture can have more than one billion data points with up to ten different dimensions. There are cases when the data has reached upwards of a trillion data points (yes, 12 zeroes). In such cases, processing power is a must along with super scalable frameworks. Researchers at various universities use machines that have 100,000+ nodes to achieve timely computations.

  • Simulating and Predicting Traffic: The problem of simulating and predicting traffic for a route has been a long-standing problem. Models for correctly simulating traffic with real-time data have been made. However, taking it further, there is a need to develop models which can properly predict traffic. No one in the field has been able to do it perfectly. It requires complex modeling and the handling of huge amounts of data with minimum latency. This means that the infrastructure of such a model should be super efficient so that it can predict traffic in real-time. Some frameworks to start with are OpenTraffic and SuMO.

  • Modeling Natural Language: The languages used in a computer are simple and in most cases ‘context-free.’ However, human languages are much more complex and require context, a huge knowledge base, and the proper grammar. As we see more Artificial Intelligence assistants come out (such as Siri, Alexa, etc.) we also start to notice the imperfections in them and how far behind they are from humans. Modeling natural language is a tough task and requires huge amounts of data to start with.

  • Fraud Detection: Be it emails, text messages, transactions or spoken word, fraud detection can be used. To know that an email is fake or a transaction is shady requires more than human intelligence. This application has potential uses in a lot of domains and is an extremely important part of any service.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,big data analtics ,big data use cases

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}