Curator's Note: The content of this article was originally written by Brad Stevenson over at the Silver Lining blog.
HDInsight is trying to provide the best of two worlds in how it manages its data.
Azure Vault Storage (ASV) and the Hadoop Distributed File System (HDFS)
implemented by HDInsight on Azure are distinct file systems that are optimized,
respectively, for the storage of data and computations on that data. ASV provides
a highly scalable and available, low cost, long term, and shareable storage
option for data that is to be processed using HDInsight. The Hadoop clusters deployed
by HDInsight on HDFS are optimized for running Map/Reduce (M/R) computational
tasks on the data.
HDInsight clusters are deployed in Azure on compute nodes to execute M/R
tasks and are dropped once these tasks have been completed. Keeping the data in
the HDFS clusters after computations have been completed would be an expensive
way to store this data. ASV provides a full featured HDFS file system over
Azure Blob storage (ABS). ABS is a robust, general purpose Azure storage
solution, so storing data in ABS enables the clusters used for computation to
be safely deleted without losing user data. ASV is not only low cost. It has been
designed as an HDFS extension to provide a seamless experience to customers by
enabling the full set of components in the Hadoop ecosystem to operate directly
on the data it manages.
In the upcoming release of HDInsight on Azure, ASV will be
the default file system. In the current developer preview on www.hadooponazure.com data stored in
prefixing the protocol scheme of the URI for the assets you are accessing with
To use this feature in the current release, you will need
HDInsight and Windows Azure Blob Storage accounts. To access your storage
account from HDInsight, go to the Cluster and click on the Manage Cluster tile.