Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Importing and Exporting Google Spreadsheets With Treasure Data

DZone's Guide to

Importing and Exporting Google Spreadsheets With Treasure Data

Are Spreadsheets the business people’s data warehouse? Let’s admit it, spreadsheets are not going away.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Despite various business intelligence vendors’ campaigns against Excel and similar programs, spreadsheets continue to be the most popular business intelligence tool. Spreadsheets are more than a BI tool because they are so accessible for business people, compared to actual databases. Spreadsheets end up becoming the source of truth for various business metrics.

Google Spreadsheets are especially convenient in this regard because collaboration and revision tracking are built in. Excel pros may bemoan the lack of keyboard shortcuts, but sharing a Google Spreadsheet is far more manageable than passing around an Excel file over email. We might go as far as saying that Google Spreadsheets are the cloud data warehouse for businesspeople.

Here at Treasure Data, we are hardly alone in this belief. Many of Treasure Data’s early customers specifically asked for the feature to output query results into Google Spreadsheet because they wanted to share query results with businesspeople who aren’t familiar with SQL.

But what about the other direction? Can Treasure Data import data from Google Spreadsheets?

Enriching Big Data With (Small) Spreadsheet Data

Importing data from Google Spreadsheets into Treasure is not as silly as you might think. Here are some real use cases from our customer prospects:

1. Google Analytics/Adwords/AdSense integration: Because Google Spreadsheet is deeply integrated with the rest of Google’s ecosystem, it ends up becoming the data hub for Google-generated data, especially Adwords and Analytics.
2. Taking snapshots of business metrics: The marketing team uses Google Spreadsheets as their business data warehouse, and the data engineering team is asked to store daily snapshots of these spreadsheets and prepare them for trend analysis.

Once Google Spreadsheet data is available inside Treasure Data, it can be joined and correlated with higher-volume product usage data coming from web/mobile/server and Internet-enabled devices.

Google AppScript and Treasure Data API

How do we go about doing this? Google Spreadsheet supports scripting via AppScript, a JavaScript implementation with additional APIs specific to Google Apps. Because Treasure Data has a simple API for importing data, we can write a simple AppScript script and schedule them to run inside a Google Spreadsheet.

The script looks as follows:

CONSTANTS = {
  maxColumns: 1000,
  rowsPerRead: 200,
  maxRows: 100000,
  tdAPIKey: 'YOUR_API_KEY',
  databaseName: undefined,
  tableName: undefined,
  timeColName: "time",
  timeLowerBound: Date.parse(new Date())/1000 - 7*86400,
  timeUpperBound: Date.parse(new Date())/1000 + 3*86400
}

function toTDEntityName(name) {
  return name.toLowerCase().replace(/[^a-z0-9_]/g, '_').replace(/_+/, '_');
}

function postTreasureData(events, database, table, apikey) {
  var data = {};
  data[database+"."+table] = events;
  var payload = JSON.stringify(data);
  Logger.log(payload);
  var options = {
    "method": "POST",
    "contentType" : "application/json",
    "headers" : {
      "X-TD-Write-Key": apikey,
      "X-TD-Data-Type": "k"
    },
    "payload": payload
  };
  var response = UrlFetchApp.fetch("http://in.treasuredata.com/js/v3/event/", options);
  Logger.log(response);
}

function readColumnNames(activeSheet) {
  // Parse column names
  // Assume that the table begins on A1, with the first row as the header
  var colNames = [];
  var colIndex = 1;
  while (colIndex < CONSTANTS.maxColumns) {
    var colName = activeSheet.getRange(1, colIndex).getValue();
    if (colName === "") { break; }
    colNames.push(colName);
    colIndex += 1;
  }
  return colNames;
}

function validateTimeColumn(o) {
  var t = o[CONSTANTS.timeColName];
  if (typeof t === "number") {
    // check that the time is within reason
    if (t < CONSTANTS.timeLowerBound || t > CONSTANTS.timeUpperBound) {
      throw new RangeError("time field = " + t.toString() + " is out of range");
    }
    return o; // otherwise good.
  }
  if (typeof t === "string" || t instanceof Date) {
    var parsed_t = Date.parse(t)/1000;
    if (parsed_t === NaN) {
      throw new Error("time field = " + t.toString() + " is malformed");
    }
    // check that the time is within reason
    if (parsed_t < CONSTANTS.timeLowerBound || parsed_t > CONSTANTS.timeUpperBound) {
      throw new RangeError("time field = " + parsed_t.toString() + " is out of range");
    }
    o[CONSTANTS.timeColName] = parsed_t;
    return o;
  }
  throw new Error("Bad type for " + t.toString());
}

function readMultipleRows(activeSheet, rowIndex, colNames, rowCount) {
  var rowJSONs = [];
  for (var offset = 0; offset < rowCount; offset++) {
    // If all values are empty in the row, then we assume that it's the end
    // activeSheet.getRange(rowIndex+offset, 1).getValue() === "") { break; }
    var rowValues = activeSheet.getRange(rowIndex+offset, 1, 1, colNames.length).getValues();
    var isEnd = true;
    for (var ii in rowValues) {
      if (rowValues[0][ii] !== "") {
        isEnd = false;
        break;
      }
    }
    if (isEnd) { break; }
    rowJSONs.push((function() {
      var o = {};
      for (var ii in colNames) {
        o[colNames[ii]] = rowValues[0][ii];
      }
      validateTimeColumn(o);
      return o;
    })());
  }
  return rowJSONs;
}

function GoogleSheet2TD() {
  var activeSpreadsheet = SpreadsheetApp.getActiveSpreadsheet();
  var databaseName = CONSTANTS.databaseName || toTDEntityName(activeSpreadsheet.getName());
  var activeSheet = SpreadsheetApp.getActiveSheet();
  var tableName = CONSTANTS.tableName || toTDEntityName(activeSheet.getName());
  var tdAPIKey = CONSTANTS.tdAPIKey;
  colNames = readColumnNames(activeSheet);

  var rowJSONs;
  var rowIndex = 2;
  while (rowIndex < CONSTANTS.maxRows) {
    rowJSONs = readMultipleRows(activeSheet, rowIndex, colNames, CONSTANTS.rowsPerRead);
    if (rowJSONs.length === 0) { break; }
    postTreasureData(rowJSONs, databaseName, tableName, tdAPIKey);
    rowIndex += rowJSONs.length;
  }
}
  1. To use the script, replace ‘YOUR_API_KEY’ with your Treasure Data API Key.GoogleSheet2TD is the function that you want to call.
  2. This script copies the data in the currently active sheet, assuming that the data is laid out as follows:
    1. Row 1 is the header, starting on A1.
    2. Subsequent rows are the body data, until it finds the row where the first column is empty.
  3. By default, the name of the spreadsheet is used as the database name (normalized to be a valid Treasure Data database name). To specify a different database name, update CONSTANTS.databaseName.
  4. By default, the name of the sheet is used as the table name (normalized to be a valid Treasure Data table name). To specify a different table name, update CONSTANTS.tableName.

Seeing is believing, so here’s a quick gif of what it looks like:

Image title

What’s Next?

Let us know how you use your data in Google Spreadsheet. The above script is obviously an early prototype, and we’d love to get your feedback to help us evolve!

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
data

Published at DZone with permission of Kiyoto Tamura, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}