Dataframes With Big Data in Couchbase

In this article, we'll explore how users more comfortable with dataframes can work with their Big Data in Couchbase's Analytics Service.

Glenn Galvizo

May. 02, 24 · Tutorial

Likes (1)

Comment

Save

1.5K Views

If you have followed any tutorial on data science in Python, chances are that you have heard about a library known as Pandas. Pandas provides a dataframe interface for your data, which is a two-dimensional tabular data structure that "shares many properties with matrices" (in the words of `R`'s definition of dataframes). Dataframes have grown over the years to be an essential component of the modern data scientist’s toolkit. In the words of Pandas creator Wes McKinney:

(pandas) is what people use for data ingest, data prep, and feature engineering for machine learning models. The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.

The latter part of McKinney's quote is (arguably) why Pandas and other similar models have been adopted by the data science community over more formally grounded models from the database community (e.g., relational tables). While it may be tempting to equate relational tables to dataframes, the two are fundamentally different. In Couchbase (and most other modern databases), data is conceptually represented as unordered (multi)-sets of records. The "set of records" abstraction gives databases more opportunities to execute queries more efficiently and over data LTM (larger than memory). On the other hand, (R and Pandas) dataframes are more akin to matrices in that the order of rows and columns matters. Data scientists have grown to expect support for operations like transpose (flipping the dataframe over its diagonal) and index-based access (e.g., df.iat[1, 2]), both of which are not as easy (or impossible) to perform with languages like SQL. I'll point to "Is a Dataframe Just a Table" by Yifan Wu for a deeper discussion of the two abstractions.

Regardless of how dataframes have found their way into the mainstream, many Python data scientists simply prefer performing their EDA (exploratory data analysis) with Pandas. In this article, we'll explore how users more comfortable with dataframes can work with their Big Data in Couchbase's Analytics Service. We'll use the Yelp Dataset for our discussion, where we have the following collections: businesses, reviews, and checkins. The setup script can be found here. We will cover how to prepare a set of features with:

the Couchbase Python SDK and Pandas (the de-facto Python dataframe standard) ;
the Couchbase Python SDK and Modin (a scalable drop-in Pandas replacement) ;
the Couchbase Python SDK and SQL++ ; and
AFrame (an in-situ, no ETL, dataframe view for Couchbase collections).

Couchbase and Pandas

To start, let's assume that we want to predict whether a business is still "in business". More specifically, we want to predict the is_open flag of a business. For brevity, we omit the model training details in this article to focus on the feature engineering process. In the snippet below, we use the Couchbase Python SDK to access a local Couchbase cluster hosting an Analytics node. We use three simple SQL++ queries to load each dataset into three separate dataframes.

    Python
   
 

   from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
import pandas as pd

auth = PasswordAuthenticator(username='admin', password='password')
cluster = Cluster('couchbase://127.0.0.1', ClusterOptions(auth))
result_to_df = lambda r: pd.DataFrame(list(cluster.analytics_query(r)))

businesses_df = result_to_df("FROM yelp.businesses b SELECT VALUE b")
checkins_df = result_to_df("FROM yelp.checkins c SELECT VALUE c")
# reviews_df = result_to_df("FROM yelp.reviews r SELECT VALUE r")
  

Note that the last line is commented out — this is its own purpose! (We will handle the review collection later.) There are approximately 7 million total reviews in our Yelp dataset and loading all reviews into memory is more than my laptop can handle. In general, McKinney suggests Pandas users should have 5-10x more RAM than the size of their dataset. Herein lies the major roadblock data scientists with Big Data must work around: the inability to elegantly reason about data LTM. Our data resides in Couchbase, and to reason about our data in Pandas we need to copy our data from Couchbase into our Python process memory. Even if we had a workstation large enough to hold our data, we must now deal with the problem of staleness. If new records arrive in Couchbase after we build our dataframes, these new records will not be used for the rest of our analysis (or more optimistically, until we rebuild our dataframes).

Having addressed the elephant in the room, let us now continue our feature engineering. The goal for each section (i.e., Couchbase and Pandas, Couchbase and Modin, Couchbase and SQL++, and Couchbase and AFrame) is to find some numeric and tabular representation of all businesses that we can subsequently use as input to an ML model. Suppose that we mess around and settle on the following numeric features:

An n-hot encoding of the top 10 categories for a business.
The total number of check-ins a business has.
The total number of reviews a business has.
The lowest star rating for a review about a given business.
The highest star rating for a review about a given business.
The average star rating for a review about a given business.
The variance in star ratings for reviews about a given business.
The kurtosis in star ratings for reviews about a given business.

We will start with item 1 of the list above. To find the top 10 categories shared across all businesses, we 1) convert the comma-separated categories column into a column of lists, 2) "explode/unnest" the categories columns to generate a row per list entry, 3) count the distinct values for the categories column, and 4) extract the top 10 values into a set.

    Python
   
 

   top_categories = businesses_df['categories'] \
    .str.split(', ') \
    .explode() \
    .value_counts() \
    .nlargest(10) \
    .keys()
top_categories = set(top_categories)
top_categories
  

    Plain Text
   
 

   {'Active Life',
 'American (New)',
 'American (Traditional)',
 'Arts & Entertainment',
 'Auto Repair',
 'Automotive',
 'Bakeries',
 'Bars',
 'Beauty & Spas',
 ...
 'Specialty Food'}
  

We are now ready to encode the existence of these top 10 categories as 1/0s. Using our business dataframe, we 1) convert the comma-separated categories column into a column of lists (again), 2) filter out all categories that are not in our previously found top 50 using a set intersection, 3) "join" the set of categories (row-wise) into a pipe-separated list, and 4) use the get_dummies() function to return a dataframe where each column corresponds to an item in step 3) list.

    Python
   
 

   ml_input_1 = businesses_df['categories']\
    .str.split(', ')\
    .apply(lambda a: None if a is None else '|'.join(set(a).intersection(top_categories)))\
    .str.get_dummies()
ml_input_1.head()
  

    Plain Text
   
 

      Active Life  American (New)  American (Traditional)  Arts & Entertainment  ...
0            0               0                       0                     0  ... 
1            0               0                       0                     0  ... 
2            0               0                       0                     0  ... 
3            0               0                       0                     1  ... 
4            0               0                       0                     0  ... 

[5 rows x 10 columns]
  

The next feature requires us to use the checkins_df dataframe. We 1) join our businesses_df dataframe with the checkins_df dataframe, 2) generate a list-valued column by splitting our comma-separated-string-valued field date, and 3) apply the len function to get the number of check-ins. For businesses that do not have a corresponding checkin row, we 4) set their value to 0. The (row-wise) order of ml_input_1 is identical to our result (i.e., they share the same "index"), so we can simply use the concat function along the horizontal axis (axis=1).

    Python
   
 

   result = businesses_df \
    .join(checkins_df.set_index('business_id'), on='business_id') \
    ['date'] \
    .str.split(',') \
    .map(lambda v: len(v) if type(v) is list else 0) \
    .rename("Checkin Count")
ml_input_2 = pd.concat([ml_input_1, result], axis=1)
ml_input_2.head()
  

    Plain Text
   
 

      Active Life  American (New)  ...  Skin Care  Specialty Food  Checkin Count 
0            0               0  ...          0               0            146 
1            0               0  ...          0               0             33 
2            0               0  ...          0               0             19 
3            0               0  ...          0               0              3 
4            0               0  ...          0               0              0 

[5 rows x 11 columns]
  

Our last set of features revolves around the distribution of star ratings for a business. We define a function below called summarize which takes in a list of star ratings and returns various statistics using the scipy.stats.describe function.

    Python
   
 

   import scipy

def summarize(all_stars):
    d = scipy.stats.describe(all_stars)
    return {
        'Review Count': d.nobs,
        'Minimum Stars': d.minmax[0],
        'Maximum Stars': d.minmax[1],
        'Mean Stars': d.mean,
        'Star Variance': d.variance,
        'Star Skewness': d.skewness,
        'Star Kurtosis': d.kurtosis
    }
  

As mentioned above, we cannot work with all reviews in memory. Instead, we will use the LIMIT and OFFSET clauses of SQL++ to work with chunks of 1 million reviews records. Note that this approach is not impervious to data changes at the Couchbase side. A more consistent (and performant) approach requires a duplicated reviews collection (i.e., a 3rd data copy of reviews) . In the snippet below, we end up with a dataframe (stars_df) that is row-wise aligned with the business_df dataframe and possesses a list-valued column called stars.

    Python
   
 

   import numpy as np

stars_df = businesses_df[['business_id']] \
    .assign(stars=[np.nan for _ in range(len(businesses_df))])
working_offset = 0
chunk_size = 1000000
while True:
    partial_df_1 = result_to_df(f"""
        FROM
            yelp.reviews r
        SELECT
            r.business_id,
            r.stars
        ORDER BY
            r.business_id ASC
        LIMIT  {chunk_size}
        OFFSET {working_offset}
    """)
    if len(partial_df_1) == 0:
        break
    working_offset += chunk_size
    partial_df_2 = stars_df.loc[stars_df['stars'].notna()].explode('stars')
    partial_df_3 = pd.concat([partial_df_1, partial_df_2], axis=0) \
        .groupby(['business_id']) \
        .agg(list)
    stars_df = stars_df[['business_id']] \
        .join(partial_df_3, on='business_id')
  

Now possessing the minimum amount of review information for a given business, our final step is to apply the summarize function and concat this results in our existing feature set...

    Python
   
   partial_df_4 = pd.json_normalize(stars_df['stars'].apply(summarize))
ml_input_3 = pd.concat([ml_input_2, partial_df_4], axis=1)
ml_input_3.head()

    Plain Text
   
 

      Active Life  American (New)  American (Traditional)  Arts & Entertainment  \
          0               0                       0                     0   
          0               0                       0                     0   
          0               0                       0                     0   
          0               0                       0                     1   
          0               0                       0                     0   

   ...  Skin Care  Specialty Food  Checkin Count  Review Count  Minimum Stars  \
...          0               0            146            70            1.0   
...          0               0             33             5            1.0   
...          0               0             19           422            1.0   
...          0               0              3             8            3.0   
...          0               0              0             9            1.0   

   Maximum Stars  Mean Stars  Star Variance  Star Skewness  Star Kurtosis  
          5.0    3.342857       1.619876      -0.494108      -0.624766  
          5.0    2.800000       3.200000       0.035156      -1.581055  
          5.0    4.760664       0.747808      -3.799428      13.180355  
          5.0    4.750000       0.500000      -2.267787       3.142857  
          5.0    3.111111       4.111111      -0.158662      -1.910062  

[5 rows x 18 columns]
  

...and now we are free to use the ml_input_3 dataframe directly as training input for some model in a machine learning library (e.g., scikit-learn)! For completeness, here's a snippet of using our feature set to train a decision tree classifier:

    Python
   
   import sklearn

x, y = ml_input_3, businesses_df['is_open']
x_train, x_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=1)
clf = sklearn.tree.DecisionTreeClassifier()
clf = clf.fit(x_train,y_train)

Couchbase and Modin

The Pandas feature engineering process works best if we are able to fit our data into memory. If we are unable to fit our data into memory (e.g., the reviews dataset), we need to devise specialized solutions to consider our data in chunks. The next three sections delve into cleaner alternatives that do not require us (the data scientist) to consider how large our data is for in-core and out-of-core workloads.

The first approach (and the easiest to integrate for existing Pandas users) is a library called Modin. Modin is a drop-in replacement for Pandas that translates Pandas operations into computations executed using a distributed runtime engine (i.e., Ray, Dask, or MPI/Unidist). We will start with three simple steps: 1) initializing our Ray backend, 2) replacing the import pandas as pd line in our first code snippet above with import modin.pandas as pd, and 3) uncommenting the line to load our reviews_df dataframe:

    Python
   
 

   import ray
ray.init()

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
import modin.pandas as pd

auth = PasswordAuthenticator(username='admin', password='password')
cluster = Cluster('couchbase://127.0.0.1', ClusterOptions(auth))
result_to_df = lambda r: pd.DataFrame(list(cluster.analytics_query(r)))

businesses_df = result_to_df("FROM yelp.businesses b SELECT VALUE b")
checkins_df = result_to_df("FROM yelp.checkins c SELECT VALUE c")
reviews_df = result_to_df("FROM yelp.reviews r SELECT VALUE r")
  

The snippet above a) moves the data out of Couchbase into b) our Python process memory, and then our Python process memory to c) Modin's Ray backend. As we will see, the subsequent operations are more performant (thanks to Modin's task parallelism) at the cost of an even more expensive loading step. Furthermore, while out-of-core operations are possible in Modin, Modin delegates this spillage to the operating system (i.e., a workload-unaware component). The next two sections will address how we can simply operate on our data within Couchbase (in-situ) using Couchbase's built-in execution engine purposed for in-core and out-of-core distributed workloads.

Having addressed the "new elephant in the second room," let us continue with our feature engineering. For all non-review-related features, our code remains nearly the same (there is a bug in the get_dummies() function for Modin, so we defer the computation of the ml_input_1 dataframe to Pandas). While we could keep our hand-crafted solution that chunks reviews (Modin is meant to be a drop-in replacement), we will showcase how Modin allows us to work out-of-core with large dataframes like reviews_df. In the snippet below, we 1) aggregate all review stars by their business_id values into list-valued columns, 2) apply our summarize function, 3) convert the application of our summarize function into 7 separate columns, 4) join this result (review_summary_df) with our original businesses_df dataframe, and 5) add these 7 new columns to our ml_input_2 dataframe to produce our final ml_input_3.

    Python
   
 

   reviews_agg_df = reviews_df[['business_id', 'stars']] \
    .groupby(['business_id']) \
    .agg(list) \
    .reset_index()
review_stars_df = pd.json_normalize(
    reviews_agg_df \
        ['stars'] \
        .apply(summarize)
)
review_summary_df = pd.concat([reviews_agg_df, review_stars_df], axis=1) \
    .drop(columns=['stars']) \
    .set_index('business_id')
business_summary_df = businesses_df[['business_id']] \
    .join(review_summary_df, on='business_id')
ml_input_3 = pd.concat([ml_input_2, business_summary_df], axis=1)
ml_input_3.head()
  

    Plain Text
   
 

      Active Life  American (New)  American (Traditional)  Arts & Entertainment  \
          0               0                       0                     0   
          0               0                       0                     0   
          0               0                       0                     0   
          0               0                       0                     1   
          0               0                       0                     0   

   ...  Skin Care  Specialty Food  Checkin Count  Review Count  Minimum Stars  \
...          0               0            146            70            1.0   
...          0               0             33             5            1.0   
...          0               0             19           422            1.0   
...          0               0              3             8            3.0   
...          0               0              0             9            1.0   

   Maximum Stars  Mean Stars  Star Variance  Star Skewness  Star Kurtosis  
          5.0    3.342857       1.619876      -0.494108      -0.624766  
          5.0    2.800000       3.200000       0.035156      -1.581055  
          5.0    4.760664       0.747808      -3.799428      13.180355  
          5.0    4.750000       0.500000      -2.267787       3.142857  
          5.0    3.111111       4.111111      -0.158662      -1.910062  

[5 rows x 18 columns]
  

The snippet above is much more elegant than our hand-crafted solution from the previous section. Data scientists are no longer bound by their data size, and data engineers can scale up a data scientist's feature extraction code by configuring the Modin backend. Modin even offers (experimental) support for instantiating Ray clusters on AWS.

The impact of data transfer between i) Couchbase, ii) Python, and iii) the Modin backend (Ray) should not be understated. The snippet above runs slower than our handcrafted approach because Modin (Ray) will naively keep the large reviews_df dataframe in its memory. If we selectively choose which review fields to pull from Couchbase...

reviews_df = result_to_df("FROM yelp.reviews r SELECT r.business_id, r.stars")

...we observe a ~3x speedup in feature extraction time (3min to 1min), but we are no longer working with just the dataframe abstraction.

Couchbase and SQL++

The previous two sections (more or less) cover how data scientists operate on their data using dataframes. In the first section, we illustrated how to work with Big Data in Pandas by hand-writing a chunking process ourselves. In the second section, we illustrated how we can use Modin as a drop-in replacement for Pandas to lower the barrier required for data scientists working with Big Data. In this section, we will show how data scientists can work with their Couchbase data in-situ using SQL++ instead of dataframes.

As a reminder, our goal is to find some numeric and tabular representation of all businesses that we can subsequently use as input to an ML model. For the previous section, this representation was a dataframe. For this section, our goal is to build a SQL++ query that will return a Python list of numeric-valued iterables. First, let us find the top 10 categories for all businesses. The query below a) iterates over all businesses, b) splits the comma-separated string-valued field categories into an array, c) UNNESTs (a SQL++ operation equivalent to the dataframe operation explode) the array we just found, d) uses a GROUP BY clause on the split categories and the aggregate function COUNT(*) to count the number of instances per category, and e) uses the ORDER BY and the LIMIT clauses to return the top 10 keys of the groups (the category).

    Python
   
 

   results = cluster.analytics_query("""
    FROM 
        yelp.businesses b, 
        SPLIT(b.categories, ",") c 
    LET 
        clean_c = TRIM(c)
    GROUP BY 
        clean_c
    SELECT VALUE
        clean_c
    ORDER BY
        COUNT(*) DESC
    LIMIT 10
""")
top_categories = list(results)
top_categories
  

    Plain Text
   
 

   ['Restaurants',
 'Food',
 'Shopping',
 'Home Services',
 'Beauty & Spas',
 'Nightlife',
 'Health & Medical',
 'Local Services',
 ...
 'Skin Care']
  

Using the categories we just found, we will assemble a list of SELECT clause projections that return 1 if a category is found and 0 otherwise. Below, is our query result list(result) is equivalent (as a list of iterables) to the ml_input_1 dataframe from the previous two sections.

    Python
   
 

   results = cluster.analytics_query(f"""
    FROM 
        yelp.businesses b 
    SELECT 
        {','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)}
""")
list(results)[0]
  

    JSON
   
 

   {
  "Active Life": 0,
  ...
  "Shopping": 1,
  "Skin Care": 0,
  "Specialty Food": 0
}
  

We will now build upon the SQL++ query above and add a subquery to compute the total number of check-ins. In the subquery below, we a) iterate over the check-ins collection, b) split and UNNEST the comma-separated-string-valued field date, and c) correlate the subquery in the WHERE clause via the conjunct c.business_id = b.business_id. In contrast to standard SQL, SQL++ queries will always return a multiset of records. Consequently, we use the SQL++ function ARRAY_COUNT around our subquery to count the number of check-ins a business has. Our query result list(result) at this stage is equivalent (as a list of iterables) to the ml_input_2 dataframe from the previous two sections.

    Python
   
 

   results = cluster.analytics_query(f"""
    FROM 
        yelp.businesses b 
    SELECT 
        {','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)},
        ARRAY_COUNT(( 
            FROM 
                yelp.checkins c,
                SPLIT(c.date, ",") d
            WHERE 
                c.business_id = b.business_id
            SELECT
                1
        )) AS `Checkin Count`
""")
list(results)[0]
  

    JSON
   
 

   {
 ...
  "Caterers": 0,
  "Checkin Count": 146,
  "Chicken Wings": 0,
  ...
}
  

The last seven features we are interested in pertaining to some statistics that were computed using scikit, a Python library. Couchbase users can leverage Python UDFs for Analytics (currently in developer preview) to execute Python code over their data using SQL++, but we will write a UDF in SQL++ that replicates the same functionality as our summarized function above.

    SQL
   
 

   CREATE ANALYTICS FUNCTION yelp.summarize(all_stars) {
    {
        "Review Count": ARRAY_COUNT(all_stars),
        "Minimum Stars": ARRAY_MIN(all_stars),
        "Maximum Stars": ARRAY_MAX(all_stars),
        "Mean Stars": ARRAY_AVG(all_stars),
        "Star Variance": ARRAY_VAR_POP(all_stars),
        "Star Skewness": ARRAY_SKEWNESS(all_stars),
        "Star Kurtosis": ARRAY_KURTOSIS(all_stars)
    }
};
  

Finally, we will add to our previous SQL++ query to finish our feature-extracting query. We again start with a correlated subquery that JOINs reviews and businesses to return all-star ratings associated with a business. Per business, the results of this subquery are (conceptually) given to the yelp.summarize call. To "promote" the results of the summarized calls to the main SELECT, we leverage the v.* feature of SQL++ SELECT clause projections. Our query result list(result) is equivalent (as a list of iterables) to the ml_input_3 dataframe from the previous two sections, and is ready to be used directly as training input for some model in a machine learning library (e.g., scikit-learn):

    Python
   
 

   results = cluster.analytics_query(f"""
    FROM 
        yelp.businesses b
    LET
        review_features = yelp.summarize((
            FROM
                yelp.reviews r
            WHERE
                r.business_id = b.business_id
            SELECT VALUE
                r.stars
        ))
    SELECT 
        {','.join(f'TO_BIGINT(CONTAINS(b.categories, "{x}")) AS `{x}`' for x in top_categories)},
        ARRAY_COUNT(( 
            FROM 
                yelp.checkins c,
                SPLIT(c.date, ",") d
            WHERE 
                c.business_id = b.business_id
            SELECT
                1
        )) AS `Checkin Count`,
        review_features.*
""")
list(results)[0]
  

    JSON
   
 

   {
  "Review Count": 70,
  "Mean Stars": 3.342857142857143,
  "Star Variance": 1.596734693877551,
  "Star Skewness": -0.49410799997654287,
  "Star Kurtosis": -0.624766331689814,
  "Minimum Stars": 1,
  "Maximum Stars": 5,
  "Restaurants": 1,
  "Food": 0,
  "Shopping": 0,
  ...
  "Checkin Count": 146
}
  

Compared to our previous two sections, this feature collection process runs in the 40s as opposed to the 1-minute execution time from Modin and 3-minute execution time from Pandas.

Couchbase and AFrame

The approach above works best for users who prefer the SQL++ abstraction to the dataframe abstraction. All the computation is performed in-situ and there is no expensive loading step into Python. For users who prefer the dataframe abstraction and want to work with data in situ, AFrame is a viable alternative to using SQL++. In a nutshell, AFrame is a Python library that provides a dataframe syntax for collections in Couchbase Analytics. Behind the scenes, AFrame translates dataframe operations into SQL++ queries that are evaluated lazily, essentially deferring all dataframe operations to Couchbase Analytics itself. While not a drop-in replacement like Modin, AFrame gives data scientists a Pandas-esque API for users who want the performance of the previous section. To start, we will clone the AFrame repository and install AFrame using pip:

    Shell
   
   git clone https://github.com/psinthong/AFrame.git
cd AFrame
pip install .

Once installed, we will import AFrame and define a dataframe for all three of our collections using the CBAnalyticsConnector connector.

    Python
   
   from aframe import AFrame
from aframe.connector import CBAnalyticsConnector

connector = CBAnalyticsConnector('http://localhost:8095', 'admin', 'password')
businesses_df = AFrame(dataverse='yelp', dataset='businesses', connector=connector)
checkins_df = AFrame(dataverse='yelp', dataset='checkins', connector=connector)
reviews_df = AFrame(dataverse='yelp', dataset='reviews', connector=connector)

We are now ready to define some features! First, let us find the top 10 categories using our businesses_df dataframe. In the snippet below, we 1) convert the comma-separated categories column into a column of lists, 2) "explode/unnest" the categories columns to generate a row per list entry, 3) count the distinct values for the categories column, and 4) extract the top 10 values into a set.

    Python
   
 

   top_categories = businesses_df['categories']
top_categories['categories'] = top_categories \
    .map('split', ', ')
top_categories = top_categories \
    .explode('categories') \
    .value_counts('categories') \
    .nlargest(10, 'count') \
    ['categories']
top_categories = set(top_categories)
top_categories
  

We are now ready to (again) encode the existence of these top 10 categories as 1/0s. The approach we will use with AFrame involves using a for loop to define a column for each category. The Analytics functions used here are contains and to_bigint. The ml_input_1 dataframe below (or more accurately, ml_input_1.drop('business_id').toPandas()) is equivalent to the ml_input_1 dataframes from the Pandas and Modin sections:

    Python
   
 

   ml_input_1 = businesses_df[['business_id', 'categories']]
for category in top_categories:
    ml_input_1[category] = ml_input_1['categories'] \
        .map('contains', category) \
        .map('to_bigint')
ml_input_1 = ml_input_1.drop('categories')
ml_input_1.head()
  

Behind the scenes, AFrame has assembled the following query to execute on Couchbase Analytics. Thanks to the composability of SQL++, AFrame (and other tools built on top of Couchbase) can define deep nested queries like the snippet below.

    SQL
   
 

   SELECT VALUE OBJECT_REMOVE(t, 'categories') 
FROM (
  SELECT t.*, to_bigint(contains(categories, "Home Services")) AS `Home Services` 
  FROM (
    SELECT t.*, to_bigint(contains(categories, "Bars")) AS `Bars`
    FROM ...
    
    .
    .
    .
    
  ) t
) t

  

In the snippet below, we 1) count the number of check-ins per business and 2) merge/join our results into the ml_input_1 dataframe from before to produce a new ml_input_2 dataframe. Again, the ml_input_2 dataframe below is equivalent to the ml_input_2 dataframe objects found in the first two sections.

    Python
   
 

   result = checkins_df
result['date'] = result \
    ['date'] \
    .map('split', ', ') \
    .map('array_count')
ml_input_2 = ml_input_1 \
    .merge(result, left_on='business_id', right_on='business_id', how='left')
ml_input_2.head()
  

To conclude, we will define our remaining seven features involving the reviews_df dataframe. In the snippet below, we use AFrame grouping and several Analytics aggregate functions that cover the statistics provided by yelp.summarize above.

    Python
   
 

   result = reviews_df[['business_id', 'stars']] \
    .groupby('business_id') \
    .agg({'stars': ['count', 'min', 'max', 'mean', 'var', 'skewness', 'kurtosis']}) \
    .rename({'count_stars': 'Review Count',
             'min_stars': 'Minimum Stars',
             'max_stars': 'Maximum Stars',
             'avg_stars': 'Mean Stars',
             'var_stars': 'Star Variance',
             'skewness_stars': 'Star Skewness',
             'kurtosis_stars': 'Star Kurtosis'})
ml_input_3 = ml_input_2 \
    .merge(result, left_on='business_id', right_on='business_id', how='left') \
    .drop('business_id')
ml_input_3.head()
  

AFrame was a research project that (unfortunately) lacked the commercial support given to projects like Modin. Consequently, the performance AFrame offers here is a little under 2 minutes. While the generated SQL++ query is equivalent (or near-equivalent) to the handcrafted SQL++ query in the previous section, most databases (including Couchbase Analytics) have historically had trouble optimizing nested queries (see work from Elliott, Cheng, Thomas-Ogbuji, and Ozoyoglu here for research in the context of SPARQL to SQL)...

...that's not to say that AFrame should be abandoned though — the exercise in this section has shown that AFrame is a very capable "dataframe to SQL++ query" generator. The generated SQL++ queries can then be used to guide the authoring of cleaner SQL++ queries that are executed using the Couchbase Python SDK. While not as clean as Modin, this workflow (AFrame + Couchbase SDK) gives data scientists the resource efficiency of the previous section and the dataframe user model of the first two sections.

Conclusion

It seems we are still in need of that "silver bullet": an API that gives us efficient out-of-core execution with a dataframe user model. In this article, we looked at four different approaches for generating features using data stored in Couchbase Analytics: 1) Pandas, 2) Modin, 3) SQL++, and 4) AFrame.

Pandas, the de facto dataframe standard for Python, has massive adoption by the data science community but suffers with Big Data;
Modin, a (near) drop-in replacement for Pandas that allows users to scale their Pandas workflows at the cost of a (potentially) expensive loading step;
SQL++, a non-dataframe user model that gives users the ability to express in-situ efficient execution of Couchbase data; and
AFrame, a dataframe wrapper that translates dataframe operations into SQL++ queries.

Big data Data science Pandas Couchbase Server

Opinions expressed by DZone contributors are their own.

Related

Trending