Does Datameer Support a Full Big Data Analysis Process?
Join the DZone community and get the full member experience.Join For Free
over the last days i had the chance to test datameer analytics solution (das). das is a platform for hadoop which includes data source integration, an analytics engine and visualization functionality. this promise of a fully integrated big data analysis process motivated me to test the product.
it really includes all required functionality for data management or etl, it provides standard tools to analyze data and there are nice ways to build visualization dashboards. for example, there are connectors for twitter, imap, hdfs, or ftp available. all menus and processes are self-explaining and the complete interface is strongly excel or spreadsheet oriented. if you are familiar with excel you can do the analyses on your big data out of the box. for a fast on the fly analyses performance you only work with a subset of your data and the analyses you store will then be automatically transformed into a kind of procedure. in the end – or according to a schedule you set – you “run” the analyses on your big data: das collects the latest data for you, das creates mapreduce jobs in the background and updates all your spreadsheets and visualizations. to close the analyses circle you can use the connectors to write your results back to hdfs or a database as hbase or many more technologies.
das is really designed for big data. if you test it with small data you will be frustrated by the performance – the overhead of creating mapreduce jobs dominates in this situation. but as soon as you start with real big data analyses this overhead gets negligible and das is taking over a lot of your programming work.
my test infrastructure
the following figure provides a nice overview of the datameer infrastructure. das supports many data sources, it runs on all hadoop distributions, it provides a rest api and you can add plugins as connectors for other modelling languages such as r (#rstats).
i tested das version 3.1.2 running on our mapr hadoop cluster version 3.0.2. after getting the latest package version from the datameer support the installation was straightforward and it worked out of the box. thanks to datameer for providing a full test license. there are several online tutorials and videos available and there are some tutorial apps. apps are another great feature of datameer. you can download datameer apps which include connectors, workbooks and visualizations for different analysis examples. and you can create your own app from your analyses and share them with your colleagues or the community.
my test data and analyses
i tested das with the famous “airline on-time performance” data set consisting of flight arrival and departure details for all commercial flights within the usa, from october 1987 to april 2008. i downloaded all the data (including supplements) to maprfs, created connectors for the data and imported the data into a workbook.
in the workbook i tested many classical statistical counting analyses:
- grouping functionality for the airports and counting the number of flights
- grouping for the airlines and calculating different statistics as mean values for the air time
- using joins to add additional information like the airline name to the airline identifier
- doing sorts to extract the most interesting airports depending on different measures
i am not an excel expert. so it took me some time to get used to this low level process of doing analyses on spreadsheets. but in the end it is a very intuitive process of creating analyses.
every new analysis will be available in a new tab in your workbook. there are several nice functionalities to support your work. for example there is a “sheet dependencies” overview which provides information about the dependencies between sheets.
apart from the classical analyses, das provides some data mining functionality. it is called “smart analystics”. so far, it covers k-means clustering, decision trees, column dependencies and recommendations. it works out of the box but is not yet on the level to be satisfying for real analyses. e.g. for k-means clustering there is no support for choosing the right number of clusters (k) and you can not switch between different distance functions (default is euclidean distance).
finally, i visualized all my results in a nice “infographic”. there are many different visualization tools and parameters available. after playing around with the settings you can create a nice dashboard and share it with your colleagues.
please be aware that the complete data set is about 5 gb. importing the data set takes about 30 minutes and running the workbook took more than 3h in my case. in the end i split my analyses into several workbooks to improve the feasibility.
it was easy to get started with datameer analytics solution (das). it is definitely a great tool to do big data analyses without any detailed hadoop or big data knowledge. furthermore, it covers many use cases and provides all required functionality for your daily analysis process. however, as soon as your analyses get more complex, the limitations of datameer become apparent and you will probably look for a more powerful tool set or start implementing your big data analyses directly on hadoop.
finally, datameer supports many steps in the big data analysis process, it works efficiently and the usability is straight forward. but big data is more than etl, data analysis and visualizing the results. you should never forget to think about your use case and the business value that you want to extract from your data. in the end, this is what should guide you in choosing the tools and/or implementations to use.
Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.