Over a million developers have joined DZone.

Weighing BigQuery vs. Redshift

DZone's Guide to

Weighing BigQuery vs. Redshift

Kiyoto Tamura explains the differences between Redshift and BigQuery and what it might mean to have a serverless future.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In broad strokes, both BigQuery and Redshift are cloud data warehousing services. Honestly, the similarities are greater than the differences, and if you are looking to graduate from MySQL/PostgreSQL/SQL Server for analytics or moving away from expensive perpetual license MPP databases, you can’t really go wrong with either.

Comparison Chart Big Query

All of this being said, there are three nuanced differences that the user should be aware of.

  1. Loading data
  2. Managing resources (and hence pricing)
  3. Ecosystem

Loading Data

Both services support multiple methods to upload data. They are comparable, but depending on your overall architecture, one might be a better fit than another.

  • Redshift
    • COPY from S3: By far the most common way to load data into Redshift is via COPY command. COPY lets you copy data from S3 into Redshift at scale. Alternatively, you can use 3rd party services or use an open source tool like Embulk. It’s worth mentioning that Treasure Data supports outputting query results into Redshift, allowing the user to join multiple data sources and pre-process them for Redshift.
    • Kinesis Firehose/Streams: You can stream data from Kinesis into Redshift as described here.
  • BigQuery
    • Bulk Upload: CSV/JSON-per-line/Avro file formats are supported. Google Cloud Storage and Datastore backups are natively supported. Do you have your data in Amazon S3? No problem, Embulk can help you do that. Again, it’s worth mentioning that Treasure Data can output query results into BigQuery.
    • Streaming Upload: BigQuery supports streaming upload. Fluentd, our open source data collector, seems to be a popular tool as described in this blog article by the Google Cloud Platform team. Note that streaming inserts are not free so pay attention to your ingest rate and cost.
    • Google Analytics Premium: If you pay 150,000 USD a year to sign up for Google Analytics Premium, you get access to your web/mobile data inside Google Analytics via BigQuery.

Managing Resources

Neither service is truly “set and forget” and requires a dedicated engineer to learn the service and maintain it. You can use various tools to automate many aspects of the operation, but someone will have to maintain automation scripts and workflows.

That said, here are things that I’ve heard first-hand from talking to users:

  • Redshift
    • You still need to allocate different instance types and create clusters on your own.
      • Pros: It lets you mix and match CPU/storage to meet your analytic needs. Also, it’s easy for the purchasing team to work with, especially if you buy reserved instances upfront.
      • Cons: You have to be aware of (virtualized) hardware limits and scale up/out based on that.
    • You are charged by the hour for each instance you spin up. You can reserve instances and/or pay upfront for roughly 30-70% discount.
  • BigQuery

  • The idea of hardware is completely abstracted away and not visible, even as virtualized units. You create tables, put data there and query them. In a sense, every user has the “same” configuration. The pros and cons are the exact opposite of Redshift’s: No need to worry about configuration, but you won’t have that flexibility.
  • Because there’s no notion of instance types, you are charged by storage, streaming inserts and queries: See their pricing page for further details. Again, this has both pros and cons.

  • Pros: No need to worry about scaling up/out instances
  • Cons: The on-demand usage based pricing is less predictable and harder for your purchasing team to work with. (They recently added Cost Controls feature to deal with this issue).
  • Ecosystem

    Ecosystem is where Redshift is clearly ahead of BigQuery. While BigQuery is an affordable, performant alternative to Redshift, they are considered to be more “up and coming” (See, for example, Gartner Magic Quadrant 2015). This perception may change in the new few years under Diane Greene’s leadership, but for now, AWS has a bigger and more mature ecosystem.

    And ecosystem matters. If you are an existing AWS customer, it’s much easier from an accounting and finance point of view to procure Redshift and add it to your AWS bill rather than open a new account with Google Cloud Platform for Big Query. Also, if you already have data in S3, then moving that data into Redshift is easier and faster than migrating it to BigQuery.

    The Serverless Future?

    The differences between Redshift and BigQuery points to a greater industry trend: Serverless computing. The term, well on its way to becoming an IT buzzword du jour, is the core philosophy behind BigQuery: “Do not worry about how your queries are run, just write them and tell it to run. Google will take care of it.” This is in sharp and clear contrast with Redshift which essentially loans you powerful software and hardware on-demand and lets you run with it.

    For now, it’s too early to tell whose approach is going to be the eventual winner. As a platform that speaks to both services, we will keep the reader updated as we learn more.

    Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

    big data ,database ,ecosystem ,bigquery ,serverless architecture ,redshift

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}