Getting Started With GenAI on BigQuery: A Step-by-Step Guide
This article shows how to use Google Cloud's BigQuery ML and Gemini to analyze and summarize text data directly within BigQuery using SQL and ML.GENERATE_TEXT function.
Join the DZone community and get the full member experience.
Join For FreeOverview
Companies are continuously looking for methods to extract important information from their continuously growing databases. Machine learning has become a game-changer in this mission, allowing us to predict user behavior, discover emerging trends, and automate complex decision-making.
Google Cloud's BigQuery ML has completely redefined the way we deal with ML by bringing model training directly to the data warehouse. This means we can come up with sophisticated ML models by writing simple SQL queries and deploying them, which removes the need to move complex data or integrate with other separate ML platforms.
What if you could elevate your data analysis even more? Picture being able to easily analyze and grasp the nuances of textual data, revealing a treasure of insights buried in customer feedback, social media comments, or product reviews. This is where Gemini, Google's advanced family of large language models, steps in. Gemini provides natural language processing (NLP) features, allowing us to carry out tasks such as sentiment analysis, topic extraction, and text classification with unmatched precision.
This comprehensive guide unveils the power of generative AI within the BigQuery environment, offering a practical, hands-on approach to harnessing its potential. A key highlight of this post is that it will walk you step by step on how you can use Gemini model to summarize Gemini product release notes, which will provide us with the most up-to-date insights into this groundbreaking technology. And, for this we will be using a BigQuery public dataset which stores information on product releases.
Setup Instructions
- Before starting, choose your GCP project, link a billing account, and enable the necessary API; full instructions here.
- Create a BigQuery dataset that will store our model, following the steps here.
- Create a cloud resource connection and get the connection's service account; the full guide is here.
- Grant access to the service account by following the steps here.
- Let’s take a look at the dataset that we would be using:
SELECT * FROM `bigquery-public-data.google_cloud_release_notes.release_notes`
LIMIT 10;
Product Release Notes Summary
Let’s walk through an example of summarizing the release notes for Gemini.
1. Create a Model
Create a remote model in BigQuery that utilizes a Vertex AI foundation model.
Syntax:
CREATE OR REPLACE MODEL
`PROJECT_ID.DATASET_ID.MODEL_NAME`
REMOTE WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
OPTIONS (ENDPOINT = 'ENDPOINT');
Code example:
- Replace
'[PROJECT_ID.DATASET_ID.MODEL_NAME]'
with yourproject_id
,dataset_id
, andmodel name
. - Replace
'[PROJECT_ID.REGION.CONNECTION_ID]'
with yourproject_id
,region
, andconnection_id
.
CREATE OR REPLACE MODEL
`[PROJECT_ID.DATASET_ID.MODEL_NAME]`
REMOTE WITH CONNECTION `[PROJECT_ID.REGION.CONNECTION_ID]`
OPTIONS (ENDPOINT = 'gemini-pro');
2. Let’s Create a Sample of the Dataset
For this example, we would only work with a subset of the data. We will focus on release_note_type
for the product “Gemini” starting from the year 2023. Hence, we would create a table for this subset of the dataset.
We are also creating a new column called 'combined_description'
, where we are aggregating the value of the column description based on release_note_type
.
Replace '[PROJECT_ID.DATASET_ID.TABLE_NAME]'
with your project_id
, dataset_id
, and table_name
:
CREATE or REPLACE TABLE `[PROJECT_ID.DATASET_ID.TABLE_NAME]` AS
SELECT
release_note_type,
ARRAY_TO_STRING(ARRAY_AGG(description), "; ") AS combined_description
FROM
`bigquery-public-data.google_cloud_release_notes.release_notes`
WHERE published_at > '2023-01-01'
AND product_name = "Gemini"
GROUP BY
release_note_type;
SELECT * FROM `[PROJECT_ID.DATASET_ID.TABLE_NAME]`;
3. Generate Text
With just a few lines of SQL, we can analyze text or visual content in our BigQuery table using that model and the ML.GENERATE_TEXT
function.
The ML.GENERATE_TEXT
syntax differs depending on the Vertex AI model that your remote model targets. Read the documentation to understand all the parameters of the ML.GENERATE_TEXT
function.
Syntax:
ML.GENERATE_TEXT(
MODEL project_id.dataset.model,
{ TABLE project_id.dataset.table | (query_statement) },
STRUCT(
[max_output_tokens AS max_output_tokens]
[, top_k AS top_k]
[, top_p AS top_p]
[, temperature AS temperature]
[, flatten_json_output AS flatten_json_output]
[, stop_sequences AS stop_sequences])
)
Code example:
- Replace
'[PROJECT_ID.DATASET_ID.TABLE_NAME]'
with yourproject_id
anddataset_id
. - Replace
'[PROJECT_ID.DATASET_ID.MODEL_NAME]'
with yourproject_id
,dataset_id
, andmodel name
.
CREATE OR REPLACE TABLE `[PROJECT_ID.DATASET_ID.TABLE_NAME]` AS
WITH
PROMPT AS (
SELECT release_note_type, combined_description, CONCAT('You are a helpful AI assistant summarizing Google Cloud release notes. Your task is to provide a concise summary of the release notes ' , combined_description)
AS prompt
FROM
`[PROJECT_ID.DATASET_ID.TABLE_NAME]`
),
SUMMARY_GENERATION AS (
SELECT *
FROM
ML.GENERATE_TEXT(
MODEL `[PROJECT_ID.DATASET_ID.MODEL_NAME]`,
(SELECT * FROM PROMPT),
STRUCT(150 AS max_output_tokens,
0.25 AS temperature,
35 AS top_k,
1.0 AS top_p,
TRUE AS flatten_json_output))
)
SELECT release_note_type, combined_description, ml_generate_text_llm_result as review_summary, FROM SUMMARY_GENERATION;
SELECT * FROM `[PROJECT_ID.DATASET_ID.TABLE_NAME]`
4. Result
We instructed the model to act as an "AI assistant summarizing Google Cloud release notes". The model then processed these notes and generated a concise summary.
The ML.GENERATE_TEXT
function returns the input table plus the following columns:
- ml_generate_text_result: This is the JSON response, and the generated text is in the text element.
- ml_generate_text_llm_result: A STRING value that contains the generated text. This column is returned when
flatten_json_output
isTRUE
. - ml_generate_text_status: A STRING value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
- ml_generate_text_grounding_result: A STRING value that contains a list of the grounding sources that the model used to gather additional information.
Conclusion
Overall, we can use Gemini to do prompt engineering and analyze our BigQuery data row by row at scale. BigQuery's integration with the Gemini models through Vertex AI allows us to interact with Gemini using SQL and take advantage of the scale of BigQuery. It helps us analyze structured, semi-structured, and unstructured data to gain new insights and enable novel, analytical applications.
We can use Gemini for several tasks, including text summarization, questioning and answering, feature extraction, and sentiment analysis. It can also help generate content and enhance data. Furthermore, the Gemini family of models, with its advanced multimodality feature, can take input like a combination of text, images, or video. This opens doors for numerous use cases like object recognition, captioning, description, and digital content understanding.
Check out the ML.GENERATE_TEXT function to learn more about available models and how to generate text using the ML.GENERATE_TEXT function here.
Opinions expressed by DZone contributors are their own.
Comments