Key Considerations for a Cloud Data Warehouse
Key Considerations for a Cloud Data Warehouse
Look at some considerations to help you select the best cloud data warehouse, with tips like identifying the use case and understanding cloud data warehouse capabilities.
Join the DZone community and get the full member experience.Join For Free
Data growth and diversity have put new pressures on traditional data warehouses, resulting in a slew of new technology evaluations. The data warehouse landscape offers a variety of options, including popular cloud solutions that offer pay-as-you-go pricing in an easy-to-use and scale package. Here are some considerations to help you select the best cloud data warehouse.
First, Identify Your Use Case
A cloud data warehouse supports numerous use cases for a variety of business needs. Here are some common use cases along with the notable capabilities required for each.
Ad Hoc Analysis
Ad hoc analysis provides guided or open queries to the data warehouse, giving the end user flexibility to explore deeper questions. Users use native SQL or an interactive visual analysis tool such as Tableau or Looker. Each query result often prompts the user to dive further into the data, going from summary views or aggregate views into distinct row level detail. A data warehouse that is good at ad hoc analysis delivers fast, consistent response across a variety of query types.
How does a data warehouse support ad hoc analysis?
- Efficient query processing that can scan, join, and aggregate data in a variety of table structures.
- Columnstore table format for optimized disk usage and accelerated aggregate query response.
- Relational data format with ANSI SQL query syntax provides a familiar, easy to use structured language.
- Built-in statistical functions such as MAX, MIN, SUM, COUNT, STD, NTILE, and RANK, to name a few, will make it easier to build sophisticated queries.
- Data security ensures different users are shielded from sensitive or unauthorized data, requiring user authentication, role based access control, and row level security..
- Scalable concurrency for supporting thousands of users running a variety of queries simultaneously.
- Native connectivity to leading business intelligence tools for easier visual analysis and collaborative dashboards.
Machine Learning and Data Science
Data science and machine learning use a data warehouse to identify trends, discover hidden data relationships, and predict future events with sophisticated algorithms. Machine learning is a technique that can learn and improve insight discovery without explicitly being programmed to do so. Data scientists will often require large volumes of data to improve their predictions and correlations. Data is often enriched and cleaned or packaged into sample data sets for faster experimentation. Experiments are commonly performed offline due to the intense processing power required by the analysis. Advances in algorithms, hardware, and machine learning tooling have led to more advanced data processing called Artificial Intelligence that can automatically identify hard to find events with relatively little human coordination.
How does a data warehouse support machine learning and data science?
- Support a variety of data types including relational, CSV, JSON, and geospatial formats.
- Provide native interoperability with data preparation and statistical tooling such as Spark, SparkML, Python, R, SAS, and TensorFlow.
- To maximize resource savings, offer rapid sandbox configuration for quick experimentation with easy to spin up and down databases.
- To support collaboration and sharing of analyses, offer native connectivity with modern business intelligence tools such as Tableau, Zoomdata, and Looker.
Real-Time and Operational Analytics
Operational analytics often manage key performance indicators (KPIs) by querying data continuously. The insights might be used by several times a day by people or machines. The speed of response for an operational or real-time analytics solution can vary based on the systems in place and the organizational readiness. Gartner’s Roy Schulte said it best in his report, How to Move Analytics to Real Time:
“Business real time is about situation awareness; sensing and responding to what is happening in the world now, rather than to what happened a few hours or days ago, or what is predicted to happen based on historical data.”
How does a data warehouse support real-time analytics?
- Stream data ingestion that can be immediately queried for analysis.
- Fast processing of repeat queries, potentially by thousands of users or applications.
- To reduce outages and maintain 24/7 operational support, high availability that includes redundancy and auto-failover.
- To improve accuracy and decision speeds, exactly-once semantics for real-time data de-duplication and enrichment.
Mixed Workload Analytics
Most organizations want a single source of data to improve decision accuracy and support a variety of workloads across ad hoc, machine learning, and real-time analytics. Thees expanded use cases place a strong emphasis on performance, security, and user or application concurrency. Due to the variety of applications requiring sub-second data access, mixed workloads can be a challenge to tune and govern.
How does a data warehouse support mixed workload analytics?
- A robust, efficient, and distributed query processor that can support a broad range of queries without overpaying for extra hardware resources or require hard to manage database configurations.
- Rapid, easy-to-scale architecture that can address changes in workload complexity and user concurrency load.
- Comprehensive security to shield users from seeing sensitive data without requiring custom database schemas or views.
- Broad data ingestion to support real-time streaming and batch load requirements.
Next Up, Understanding Cloud Data Warehouse Capabilities
As you evaluate your next cloud data warehouse investment, it’s important to know the range of capabilities that are important for your project or business. Below is a list of capabilities organized by category to help you identify the right data warehouse.
- Rapid provisioning. Setup should be self-service and take a few minutes from the point of sign up to a running functioning database.
- Accessibility. For easy query processing and integration with existing applications, tools, and skills, the environment should support relational data using ANSI SQL.
- Easy data loading. A guided or integrated data loading process should give users an easy integrated way to deploy a real-time data pipeline or bulk load ingestion.
- Optimized query processing. The database should have a distributed query optimizer that can process most queries with minimal specialized tuning.
- Simplified capacity management. As data or user growth expands, the data warehouse should provide a managed or automated capacity adjustment to quickly address changing workloads.
- Ingest to analysis. Streaming data ingestion with simultaneous query processing ensures the fastest possible insights on live and historical data.
- Fast queries. Subsecond query response against billions of rows with vectorized query processing and columnstore structure for ad-hoc dashboards or operational reports.
- Operationally tuned. Compiled SQL queries accelerate query execution for added performance gains.
- On-demand pricing. Sometimes a data warehouse is not required for 24/7 operation; hourly billing can tightly associate the usage to payment.
- Annual discounts. Reserved pricing discounts should be an option for operational deployments that are always available.
- Multicloud. To maximize the proximity of your data and get the ultimate performance for your applications, you need the freedom to choose the cloud service provider you prefer or have standardized on.
- Hybrid cloud: Maintain existing investments by spanning data warehouse investments across on-premises and cloud on a single platform.
- Elastic: Driven by growth in data, users, or query sophistication, rapidly scale out or down for new capacity requirements.
- Interoperable: To ensure compatibility with existing tools, applications, and skills, support JDBC/ODBC connectivity, MySQL wire protocol, and ANSI SQL.
- Concurrency support. Scale-out distributed architecture ensures that high volume ingest and write queries do not degrade dashboard or report performance.
- High availability. Efficient replication and distributed architecture ensure no single point of failure for operational requirements.
- Durable. All data should reside on disk for audit or regulatory requirements along with expedited recovery from unexpected failures.
- Comprehensive. Data should be secured across the analysis lifecycle, from single sign-on (SSO) authentication, role-based access control (RBAC), SSL encryption of data over the wire, encryption of data at rest, granular audit logging, and separation of concerns for database administrators
- Consistent. Ensure a consistent security model across on-premises to the cloud with strong security capabilities across deployment.
Published at DZone with permission of Mike Boyarski . See the original article here.
Opinions expressed by DZone contributors are their own.