Dataframes With Big Data in Couchbase
In this article, we'll explore how users more comfortable with dataframes can work with their Big Data in Couchbase's Analytics Service.
Join the DZone community and get the full member experience.
Join For FreeIf you have followed any tutorial on data science in Python, chances are that you have heard about a library known as Pandas. Pandas provides a dataframe interface for your data, which is a two-dimensional tabular data structure that "shares many properties with matrices" (in the words of `R`'s definition of dataframes). Dataframes have grown over the years to be an essential component of the modern data scientist’s toolkit. In the words of Pandas creator Wes McKinney:
(pandas) is what people use for data ingest, data prep, and feature engineering for machine learning models. The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.
The latter part of McKinney's quote is (arguably) why Pandas and other similar models have been adopted by the data science community over more formally grounded models from the database community (e.g., relational tables). While it may be tempting to equate relational tables to dataframes, the two are fundamentally different. In Couchbase (and most other modern databases), data is conceptually represented as unordered (multi)-sets of records. The "set of records" abstraction gives databases more opportunities to execute queries more efficiently and over data LTM (larger than memory). On the other hand, (R and Pandas) dataframes are more akin to matrices in that the order of rows and columns matters. Data scientists have grown to expect support for operations like transpose (flipping the dataframe over its diagonal) and index-based access (e.g., df.iat[1, 2]
), both of which are not as easy (or impossible) to perform with languages like SQL. I'll point to "Is a Dataframe Just a Table" by Yifan Wu for a deeper discussion of the two abstractions.
Regardless of how dataframes have found their way into the mainstream, many Python data scientists simply prefer performing their EDA (exploratory data analysis) with Pandas. In this article, we'll explore how users more comfortable with dataframes can work with their Big Data in Couchbase's Analytics Service. We'll use the Yelp Dataset for our discussion, where we have the following collections: businesses
, reviews
, and checkins
. The setup script can be found here. We will cover how to prepare a set of features with:
- the Couchbase Python SDK and Pandas (the de-facto Python dataframe standard) ;
- the Couchbase Python SDK and Modin (a scalable drop-in Pandas replacement) ;
- the Couchbase Python SDK and SQL++ ; and
- AFrame (an in-situ, no ETL, dataframe view for Couchbase collections).
Couchbase and Pandas
To start, let's assume that we want to predict whether a business is still "in business". More specifically, we want to predict the is_open
flag of a business
. For brevity, we omit the model training details in this article to focus on the feature engineering process. In the snippet below, we use the Couchbase Python SDK to access a local Couchbase cluster hosting an Analytics node. We use three simple SQL++ queries to load each dataset into three separate dataframes.
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
import pandas as pd
auth = PasswordAuthenticator(username='admin', password='password')
cluster = Cluster('couchbase://127.0.0.1', ClusterOptions(auth))
result_to_df = lambda r: pd.DataFrame(list(cluster.analytics_query(r)))
businesses_df = result_to_df("FROM yelp.businesses b SELECT VALUE b")
checkins_df = result_to_df("FROM yelp.checkins c SELECT VALUE c")
# reviews_df = result_to_df("FROM yelp.reviews r SELECT VALUE r")
Note that the last line is commented out — this is its own purpose! (We will handle the review collection later.) There are approximately 7 million total reviews in our Yelp dataset and loading all reviews into memory is more than my laptop can handle. In general, McKinney suggests Pandas users should have 5-10x more RAM than the size of their dataset. Herein lies the major roadblock data scientists with Big Data must work around: the inability to elegantly reason about data LTM. Our data resides in Couchbase, and to reason about our data in Pandas we need to copy our data from Couchbase into our Python process memory. Even if we had a workstation large enough to hold our data, we must now deal with the problem of staleness. If new records arrive in Couchbase after we build our dataframes, these new records will not be used for the rest of our analysis (or more optimistically, until we rebuild our dataframes).
Having addressed the elephant in the room, let us now continue our feature engineering. The goal for each section (i.e., Couchbase and Pandas, Couchbase and Modin, Couchbase and SQL++, and Couchbase and AFrame) is to find some numeric and tabular representation of all businesses that we can subsequently use as input to an ML model. Suppose that we mess around and settle on the following numeric features:
- An n-hot encoding of the top 10 categories for a business.
- The total number of check-ins a business has.
- The total number of reviews a business has.
- The lowest star rating for a review about a given business.
- The highest star rating for a review about a given business.
- The average star rating for a review about a given business.
- The variance in star ratings for reviews about a given business.
- The kurtosis in star ratings for reviews about a given business.
We will start with item 1 of the list above. To find the top 10 categories shared across all businesses, we 1) convert the comma-separated categories
column into a column of lists, 2) "explode/unnest" the categories
columns to generate a row per list entry, 3) count the distinct values for the categories
column, and 4) extract the top 10 values into a set.
top_categories = businesses_df['categories'] \
.str.split(', ') \
.explode() \
.value_counts() \
.nlargest(10) \
.keys()
top_categories = set(top_categories)
top_categories
{'Active Life',
'American (New)',
'American (Traditional)',
'Arts & Entertainment',
'Auto Repair',
'Automotive',
'Bakeries',
'Bars',
'Beauty & Spas',
...
'Specialty Food'}
We are now ready to encode the existence of these top 10 categories as 1/0s. Using our business dataframe, we 1) convert the comma-separated categories
column into a column of lists (again), 2) filter out all categories that are not in our previously found top 50 using a set intersection, 3) "join" the set of categories (row-wise) into a pipe-separated list, and 4) use the get_dummies()
function to return a dataframe where each column corresponds to an item in step 3) list.
ml_input_1 = businesses_df['categories']\
.str.split(', ')\
.apply(lambda a: None if a is None else '|'.join(set(a).intersection(top_categories)))\
.str.get_dummies()
ml_input_1.head()
Active Life American (New) American (Traditional) Arts & Entertainment ...
0 0 0 0 0 ...
1 0 0 0 0 ...
2 0 0 0 0 ...
3 0 0 0 1 ...
4 0 0 0 0 ...
[5 rows x 10 columns]
The next feature requires us to use the checkins_df
dataframe. We 1) join our businesses_df
dataframe with the checkins_df
dataframe, 2) generate a list-valued column by splitting our comma-separated-string-valued field date
, and 3) apply the len
function to get the number of check-ins. For businesses that do not have a corresponding checkin row, we 4) set their value to 0. The (row-wise) order of ml_input_1
is identical to our result (i.e., they share the same "index"), so we can simply use the concat
function along the horizontal axis (axis=1)
.
result = businesses_df \
.join(checkins_df.set_index('business_id'), on='business_id') \
['date'] \
.str.split(',') \
.map(lambda v: len(v) if type(v) is list else 0) \
.rename("Checkin Count")
ml_input_2 = pd.concat([ml_input_1, result], axis=1)
ml_input_2.head()
Active Life American (New) ... Skin Care Specialty Food Checkin Count
0 0 0 ... 0 0 146
1 0 0 ... 0 0 33
2 0 0 ... 0 0 19
3 0 0 ... 0 0 3
4 0 0 ... 0 0 0
[5 rows x 11 columns]
Our last set of features revolves around the distribution of star ratings for a business. We define a function below called summarize
which takes in a list of star ratings and returns various statistics using the scipy.stats.describe
function.
import scipy
def summarize(all_stars):
d = scipy.stats.describe(all_stars)
return {
'Review Count': d.nobs,
'Minimum Stars': d.minmax[0],
'Maximum Stars': d.minmax[1],
'Mean Stars': d.mean,
'Star Variance': d.variance,
'Star Skewness': d.skewness,
'Star Kurtosis': d.kurtosis
}
As mentioned above, we cannot work with all reviews in memory. Instead, we will use the LIMIT
and OFFSET
clauses of SQL++ to work with chunks of 1 million reviews
records. Note that this approach is not impervious to data changes at the Couchbase side. A more consistent (and performant) approach requires a duplicated reviews
collection (i.e., a 3rd data copy of reviews
) . In the snippet below, we end up with a dataframe (stars_df
) that is row-wise aligned with the business_df
dataframe and possesses a list-valued column called stars
.
import numpy as np
stars_df = businesses_df[['business_id']] \
.assign(stars=[np.nan for _ in range(len(businesses_df))])
working_offset = 0
chunk_size = 1000000
while True:
partial_df_1 = result_to_df(f"""
FROM
yelp.reviews r
SELECT
r.business_id,
r.stars
ORDER BY
r.business_id ASC
LIMIT {chunk_size}
OFFSET {working_offset}
""")
if len(partial_df_1) == 0:
break
working_offset += chunk_size
partial_df_2 = stars_df.loc[stars_df['stars'].notna()].explode('stars')
partial_df_3 = pd.concat([partial_df_1, partial_df_2], axis=0) \
.groupby(['business_id']) \
.agg(list)
stars_df = stars_df[['business_id']] \
.join(partial_df_3, on='business_id')
Now possessing the minimum amount of review information for a given business, our final step is to apply the summarize
function and concat
this results in our existing feature set...
partial_df_4 = pd.json_normalize(stars_df['stars'].apply(summarize))
ml_input_3 = pd.concat([ml_input_2, partial_df_4], axis=1)
ml_input_3.head()
Active Life American (New) American (Traditional) Arts & Entertainment \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 0
... Skin Care Specialty Food Checkin Count Review Count Minimum Stars \
0 ... 0 0 146 70 1.0
1 ... 0 0 33 5 1.0
2 ... 0 0 19 422 1.0
3 ... 0 0 3 8 3.0
4 ... 0 0 0 9 1.0
Maximum Stars Mean Stars Star Variance Star Skewness Star Kurtosis
0 5.0 3.342857 1.619876 -0.494108 -0.624766
1 5.0 2.800000 3.200000 0.035156 -1.581055
2 5.0 4.760664 0.747808 -3.799428 13.180355
3 5.0 4.750000 0.500000 -2.267787 3.142857
4 5.0 3.111111 4.111111 -0.158662 -1.910062
[5 rows x 18 columns]
...and now we are free to use the ml_input_3
dataframe directly as training input for some model in a machine learning library (e.g., scikit-learn)! For completeness, here's a snippet of using our feature set to train a decision tree classifier:
import sklearn
x, y = ml_input_3, businesses_df['is_open']
x_train, x_test, y_train, y_test = \
sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=1)
clf = sklearn.tree.DecisionTreeClassifier()
clf = clf.fit(x_train,y_train)
Couchbase and Modin
The Pandas feature engineering process works best if we are able to fit our data into memory. If we are unable to fit our data into memory (e.g., the reviews dataset), we need to devise specialized solutions to consider our data in chunks. The next three sections delve into cleaner alternatives that do not require us (the data scientist) to consider how large our data is for in-core and out-of-core workloads.
The first approach (and the easiest to integrate for existing Pandas users) is a library called Modin. Modin is a drop-in replacement for Pandas that translates Pandas operations into computations executed using a distributed runtime engine (i.e., Ray, Dask, or MPI/Unidist). We will start with three simple steps: 1) initializing our Ray backend, 2) replacing the import pandas as pd
line in our first code snippet above with import modin.pandas as pd
, and 3) uncommenting the line to load our reviews_df
dataframe:
import ray
ray.init()
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
import modin.pandas as pd
auth = PasswordAuthenticator(username='admin', password='password')
cluster = Cluster('couchbase://127.0.0.1', ClusterOptions(auth))
result_to_df = lambda r: pd.DataFrame(list(cluster.analytics_query(r)))
businesses_df = result_to_df("FROM yelp.businesses b SELECT VALUE b")
checkins_df = result_to_df("FROM yelp.checkins c SELECT VALUE c")
reviews_df = result_to_df("FROM yelp.reviews r SELECT VALUE r")
The snippet above a) moves the data out of Couchbase into b) our Python process memory, and then our Python process memory to c) Modin's Ray backend. As we will see, the subsequent operations are more performant (thanks to Modin's task parallelism) at the cost of an even more expensive loading step. Furthermore, while out-of-core operations are possible in Modin, Modin delegates this spillage to the operating system (i.e., a workload-unaware component). The next two sections will address how we can simply operate on our data within Couchbase (in-situ) using Couchbase's built-in execution engine purposed for in-core and out-of-core distributed workloads.
Having addressed the "new elephant in the second room," let us continue with our feature engineering. For all non-review-related features, our code remains nearly the same (there is a bug in the get_dummies()
function for Modin, so we defer the computation of the ml_input_1
dataframe to Pandas). While we could keep our hand-crafted solution that chunks reviews (Modin is meant to be a drop-in replacement), we will showcase how Modin allows us to work out-of-core with large dataframes like reviews_df
. In the snippet below, we 1) aggregate all review stars by their business_id
values into list-valued columns, 2) apply our summarize
function, 3) convert the application of our summarize
function into 7 separate columns, 4) join this result (review_summary_df
) with our original businesses_df
dataframe, and 5) add these 7 new columns to our ml_input_2
dataframe to produce our final ml_input_3
.
reviews_agg_df = reviews_df[['business_id', 'stars']] \
.groupby(['business_id']) \
.agg(list) \
.reset_index()
review_stars_df = pd.json_normalize(
reviews_agg_df \
['stars'] \
.apply(summarize)
)
review_summary_df = pd.concat([reviews_agg_df, review_stars_df], axis=1) \
.drop(columns=['stars']) \
.set_index('business_id')
business_summary_df = businesses_df[['business_id']] \
.join(review_summary_df, on='business_id')
ml_input_3 = pd.concat([ml_input_2, business_summary_df], axis=1)
ml_input_3.head()
Active Life American (New) American (Traditional) Arts & Entertainment \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 0
... Skin Care Specialty Food Checkin Count Review Count Minimum Stars \
0 ... 0 0 146 70 1.0
1 ... 0 0 33 5 1.0
2 ... 0 0 19 422 1.0
3 ... 0 0 3 8 3.0
4 ... 0 0 0 9 1.0
Maximum Stars Mean Stars Star Variance Star Skewness Star Kurtosis
0 5.0 3.342857 1.619876 -0.494108 -0.624766
1 5.0 2.800000 3.200000 0.035156 -1.581055
2 5.0 4.760664 0.747808 -3.799428 13.180355
3 5.0 4.750000 0.500000 -2.267787 3.142857
4 5.0 3.111111 4.111111 -0.158662 -1.910062
[5 rows x 18 columns]
The snippet above is much more elegant than our hand-crafted solution from the previous section. Data scientists are no longer bound by their data size, and data engineers can scale up a data scientist's feature extraction code by configuring the Modin backend. Modin even offers (experimental) support for instantiating Ray clusters on AWS.
The impact of data transfer between i) Couchbase, ii) Python, and iii) the Modin backend (Ray) should not be understated. The snippet above runs slower than our handcrafted approach because Modin (Ray) will naively keep the large reviews_df
dataframe in its memory. If we selectively choose which review fields to pull from Couchbase...
reviews_df = result_to_df("FROM yelp.reviews r SELECT r.business_id, r.stars")
...we observe a ~3x speedup in feature extraction time (3min to 1min), but we are no longer working with just the dataframe abstraction.
Couchbase and SQL++
The previous two sections (more or less) cover how data scientists operate on their data using dataframes. In the first section, we illustrated how to work with Big Data in Pandas by hand-writing a chunking process ourselves. In the second section, we illustrated how we can use Modin as a drop-in replacement for Pandas to lower the barrier required for data scientists working with Big Data. In this section, we will show how data scientists can work with their Couchbase data in-situ using SQL++ instead of dataframes.
As a reminder, our goal is to find some numeric and tabular representation of all businesses that we can subsequently use as input to an ML model. For the previous section, this representation was a dataframe. For this section, our goal is to build a SQL++ query that will return a Python list of numeric-valued iterable
s. First, let us find the top 10 categories for all businesses. The query below a) iterates over all businesses, b) splits the comma-separated string-valued field categories
into an array, c) UNNEST
s (a SQL++ operation equivalent to the dataframe operation explode
) the array we just found, d) uses a GROUP BY
clause on the split categories and the aggregate function COUNT(*)
to count the number of instances per category, and e) uses the ORDER BY
and the LIMIT
clauses to return the top 10 keys of the groups (the category).
results = cluster.analytics_query("""
FROM
yelp.businesses b,
SPLIT(b.categories, ",") c
LET
clean_c = TRIM(c)
GROUP BY
clean_c
SELECT VALUE
clean_c
ORDER BY
COUNT(*) DESC
LIMIT 10
""")
top_categories = list(results)
top_categories
['Restaurants',
'Food',
'Shopping',
'Home Services',
'Beauty & Spas',
'Nightlife',
'Health & Medical',
'Local Services',
...
'Skin Care']
Using the categories we just found, we will assemble a list of SELECT
clause projections that return 1 if a category is found and 0 otherwise. Below, is our query result list(result)
is equivalent (as a list of iterable
s) to the ml_input_1
dataframe from the previous two sections.
results = cluster.analytics_query(f"""
FROM
yelp.businesses b
SELECT
{','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)}
""")
list(results)[0]
{
"Active Life": 0,
...
"Shopping": 1,
"Skin Care": 0,
"Specialty Food": 0
}
We will now build upon the SQL++ query above and add a subquery to compute the total number of check-ins. In the subquery below, we a) iterate over the check-ins collection, b) split and UNNEST
the comma-separated-string-valued field date
, and c) correlate the subquery in the WHERE
clause via the conjunct c.business_id = b.business_id
. In contrast to standard SQL, SQL++ queries will always return a multiset of records. Consequently, we use the SQL++ function ARRAY_COUNT
around our subquery to count the number of check-ins a business has. Our query result list(result)
at this stage is equivalent (as a list of iterable
s) to the ml_input_2
dataframe from the previous two sections.
results = cluster.analytics_query(f"""
FROM
yelp.businesses b
SELECT
{','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)},
ARRAY_COUNT((
FROM
yelp.checkins c,
SPLIT(c.date, ",") d
WHERE
c.business_id = b.business_id
SELECT
1
)) AS `Checkin Count`
""")
list(results)[0]
{
...
"Caterers": 0,
"Checkin Count": 146,
"Chicken Wings": 0,
...
}
The last seven features we are interested in pertaining to some statistics that were computed using scikit, a Python library. Couchbase users can leverage Python UDFs for Analytics (currently in developer preview) to execute Python code over their data using SQL++, but we will write a UDF in SQL++ that replicates the same functionality as our summarized function above.
CREATE ANALYTICS FUNCTION yelp.summarize(all_stars) {
{
"Review Count": ARRAY_COUNT(all_stars),
"Minimum Stars": ARRAY_MIN(all_stars),
"Maximum Stars": ARRAY_MAX(all_stars),
"Mean Stars": ARRAY_AVG(all_stars),
"Star Variance": ARRAY_VAR_POP(all_stars),
"Star Skewness": ARRAY_SKEWNESS(all_stars),
"Star Kurtosis": ARRAY_KURTOSIS(all_stars)
}
};
Finally, we will add to our previous SQL++ query to finish our feature-extracting query. We again start with a correlated subquery that JOIN
s reviews and businesses to return all-star ratings associated with a business. Per business, the results of this subquery are (conceptually) given to the yelp.summarize
call. To "promote" the results of the summarized calls to the main SELECT
, we leverage the v.*
feature of SQL++ SELECT
clause projections. Our query result list(result)
is equivalent (as a list of iterable
s) to the ml_input_3
dataframe from the previous two sections, and is ready to be used directly as training input for some model in a machine learning library (e.g., scikit-learn):
results = cluster.analytics_query(f"""
FROM
yelp.businesses b
LET
review_features = yelp.summarize((
FROM
yelp.reviews r
WHERE
r.business_id = b.business_id
SELECT VALUE
r.stars
))
SELECT
{','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)},
ARRAY_COUNT((
FROM
yelp.checkins c,
SPLIT(c.date, ",") d
WHERE
c.business_id = b.business_id
SELECT
1
)) AS `Checkin Count`,
review_features.*
""")
list(results)[0]
{
"Review Count": 70,
"Mean Stars": 3.342857142857143,
"Star Variance": 1.596734693877551,
"Star Skewness": -0.49410799997654287,
"Star Kurtosis": -0.624766331689814,
"Minimum Stars": 1,
"Maximum Stars": 5,
"Restaurants": 1,
"Food": 0,
"Shopping": 0,
...
"Checkin Count": 146
}
Compared to our previous two sections, this feature collection process runs in the 40s as opposed to the 1-minute execution time from Modin and 3-minute execution time from Pandas.
Couchbase and AFrame
The approach above works best for users who prefer the SQL++ abstraction to the dataframe abstraction. All the computation is performed in-situ and there is no expensive loading step into Python. For users who prefer the dataframe abstraction and want to work with data in situ, AFrame is a viable alternative to using SQL++. In a nutshell, AFrame is a Python library that provides a dataframe syntax for collections in Couchbase Analytics. Behind the scenes, AFrame translates dataframe operations into SQL++ queries that are evaluated lazily, essentially deferring all dataframe operations to Couchbase Analytics itself. While not a drop-in replacement like Modin, AFrame gives data scientists a Pandas-esque API for users who want the performance of the previous section. To start, we will clone the AFrame repository and install AFrame using pip
:
git clone https://github.com/psinthong/AFrame.git
cd AFrame
pip install .
Once installed, we will import AFrame and define a dataframe for all three of our collections using the CBAnalyticsConnector
connector.
from aframe import AFrame
from aframe.connector import CBAnalyticsConnector
connector = CBAnalyticsConnector('http://localhost:8095', 'admin', 'password')
businesses_df = AFrame(dataverse='yelp', dataset='businesses', connector=connector)
checkins_df = AFrame(dataverse='yelp', dataset='checkins', connector=connector)
reviews_df = AFrame(dataverse='yelp', dataset='reviews', connector=connector)
We are now ready to define some features! First, let us find the top 10 categories using our businesses_df
dataframe. In the snippet below, we 1) convert the comma-separated categories
column into a column of lists, 2) "explode/unnest" the categories
columns to generate a row per list entry, 3) count the distinct values for the categories
column, and 4) extract the top 10 values into a set.
top_categories = businesses_df['categories']
top_categories['categories'] = top_categories \
.map('split', ', ')
top_categories = top_categories \
.explode('categories') \
.value_counts('categories') \
.nlargest(10, 'count') \
['categories']
top_categories = set(top_categories)
top_categories
We are now ready to (again) encode the existence of these top 10 categories as 1/0s. The approach we will use with AFrame involves using a for loop to define a column for each category. The Analytics functions used here are contains
and to_bigint
. The ml_input_1
dataframe below (or more accurately, ml_input_1.drop('business_id').toPandas()
) is equivalent to the ml_input_1
dataframes from the Pandas and Modin sections:
ml_input_1 = businesses_df[['business_id', 'categories']]
for category in top_categories:
ml_input_1[category] = ml_input_1['categories'] \
.map('contains', category) \
.map('to_bigint')
ml_input_1 = ml_input_1.drop('categories')
ml_input_1.head()
Behind the scenes, AFrame has assembled the following query to execute on Couchbase Analytics. Thanks to the composability of SQL++, AFrame (and other tools built on top of Couchbase) can define deep nested queries like the snippet below.
SELECT VALUE OBJECT_REMOVE(t, 'categories')
FROM (
SELECT t.*, to_bigint(contains(categories, "Home Services")) AS `Home Services`
FROM (
SELECT t.*, to_bigint(contains(categories, "Bars")) AS `Bars`
FROM ...
.
.
.
) t
) t
In the snippet below, we 1) count the number of check-ins per business and 2) merge/join our results into the ml_input_1
dataframe from before to produce a new ml_input_2
dataframe. Again, the ml_input_2
dataframe below is equivalent to the ml_input_2
dataframe objects found in the first two sections.
result = checkins_df
result['date'] = result \
['date'] \
.map('split', ', ') \
.map('array_count')
ml_input_2 = ml_input_1 \
.merge(result, left_on='business_id', right_on='business_id', how='left')
ml_input_2.head()
To conclude, we will define our remaining seven features involving the reviews_df
dataframe. In the snippet below, we use AFrame grouping and several Analytics aggregate functions that cover the statistics provided by yelp.summarize
above.
result = reviews_df[['business_id', 'stars']] \
.groupby('business_id') \
.agg({'stars': ['count', 'min', 'max', 'mean', 'var', 'skewness', 'kurtosis']}) \
.rename({'count_stars': 'Review Count',
'min_stars': 'Minimum Stars',
'max_stars': 'Maximum Stars',
'avg_stars': 'Mean Stars',
'var_stars': 'Star Variance',
'skewness_stars': 'Star Skewness',
'kurtosis_stars': 'Star Kurtosis'})
ml_input_3 = ml_input_2 \
.merge(result, left_on='business_id', right_on='business_id', how='left') \
.drop('business_id')
ml_input_3.head()
AFrame was a research project that (unfortunately) lacked the commercial support given to projects like Modin. Consequently, the performance AFrame offers here is a little under 2 minutes. While the generated SQL++ query is equivalent (or near-equivalent) to the handcrafted SQL++ query in the previous section, most databases (including Couchbase Analytics) have historically had trouble optimizing nested queries (see work from Elliott, Cheng, Thomas-Ogbuji, and Ozoyoglu here for research in the context of SPARQL to SQL)...
...that's not to say that AFrame should be abandoned though — the exercise in this section has shown that AFrame is a very capable "dataframe to SQL++ query" generator. The generated SQL++ queries can then be used to guide the authoring of cleaner SQL++ queries that are executed using the Couchbase Python SDK. While not as clean as Modin, this workflow (AFrame + Couchbase SDK) gives data scientists the resource efficiency of the previous section and the dataframe user model of the first two sections.
Conclusion
It seems we are still in need of that "silver bullet": an API that gives us efficient out-of-core execution with a dataframe user model. In this article, we looked at four different approaches for generating features using data stored in Couchbase Analytics: 1) Pandas, 2) Modin, 3) SQL++, and 4) AFrame.
- Pandas, the de facto dataframe standard for Python, has massive adoption by the data science community but suffers with Big Data;
- Modin, a (near) drop-in replacement for Pandas that allows users to scale their Pandas workflows at the cost of a (potentially) expensive loading step;
- SQL++, a non-dataframe user model that gives users the ability to express in-situ efficient execution of Couchbase data; and
- AFrame, a dataframe wrapper that translates dataframe operations into SQL++ queries.
Opinions expressed by DZone contributors are their own.
Comments