DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Building a Real-Time Change Data Capture Pipeline With Debezium, Kafka, and PostgreSQL
  • Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Exploring Intercooler.js: Simplify AJAX With HTML Attributes

Trending

  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  • The Evolution of Scalable and Resilient Container Infrastructure
  1. DZone
  2. Data Engineering
  3. Data
  4. Boosted Embeddings with Catboost

Boosted Embeddings with Catboost

In this article, gain insights into Catboost, which is a simple and lesser-known way to use embeddings with gradient boosted models.

By 
Taras Baranyuk user avatar
Taras Baranyuk
DZone Core CORE ·
Updated Aug. 26, 21 · Code Snippet
Likes (15)
Comment
Save
Tweet
Share
7.3K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

When working with a large amount of data, it becomes necessary to compress the space with features into vectors. An example is text embeddings, which are an integral part of almost any NLP model creation process. Unfortunately, it is far from always possible to use neural networks to work with this type of data — the reason, for example, maybe a low fitting or inference rates.

I want to suggest an interesting way to use gradient boosting that few people know about.

Data

One of Kaggle’s competitions recently ended, and a small dataset with text data was presented there. I decided to take this data for experiments since the competition showed that the dataset is labelled well, and I did not face any unpleasant surprises.

Kaggle competition dataset

Columns:

  • id - unique ID for excerpt
  • url_legal - URL of source
  • license - license of source material
  • excerpt - text to predict reading ease
  • target - reading ease
  • standard_error - measure of the spread of scores among multiple raters for each excerpt

As a target in the dataset, it is a numerical variable, and it is proposed to solve the regression problem. However, I decided to replace it with a classification problem. The main reason is that the library I will use does not support working with text and embeddings in regression problems. I hope that developers will eliminate this deficiency in the future. But in any case, the problems of regression and classification are closely related, and for analysis, it makes no difference which of the problems to solve.

Count/value graph

Let's calculate the number of bins by Sturge’s rule:

num_bins = int(np.floor(1 + np.log2(len(train))))
train['target_q'], bin_edges = pd.qcut(train['target'],
    q=num_bins, labels=False, retbins=True, precision=0)
Labels graph

But, first, I clean up the data.

train['license'] = train['license'].fillna('nan')
train['license'] = train['license'].astype('category').cat.codes

With the help of a small self-written function, I clean and lemmatize the text. The function can be complicated, but this is enough for my experiment.

def clean_text(text):
    
    table = text.maketrans(
        dict.fromkeys(string.punctuation))
    
    words = word_tokenize(
        text.lower().strip().translate(table))
    words = [word for word in words if word not in stopwords.words('english')]
    lemmed = [WordNetLemmatizer().lemmatize(word) for word in words]    
    return " ".join(lemmed)

I save the cleaned text as a new feature.

train['clean_excerpt'] = train['excerpt'].apply(clean_text)

In addition to text, I can select individual words in URLs and turn this data into a new text feature.

def getWordsFromURL(url):
    return re.compile(r'[\:/?=\-&.]+',re.UNICODE).split(url)
train['url_legal'] = train['url_legal'].fillna("nan").apply(getWordsFromURL).apply(
    lambda x: " ".join(x))

I created several new features from the text — these are various pieces of statistical information. Again, there is a lot of room for creativity, but this data is enough for us. The main purpose of these features is to be useful for the baseline model.

def get_sentence_lengths(text):

    tokened = sent_tokenize(text)
    lengths = []
    
    for idx,i in enumerate(tokened):
        splited = list(i.split(" "))
        lengths.append(len(splited))

    return (max(lengths),
            min(lengths),
            round(mean(lengths), 3))
def create_features(df):
    
    df_f = pd.DataFrame(index=df.index)
    df_f['text_len'] = df['excerpt'].apply(len)
    df_f['text_clean_len' ]= df['clean_excerpt'].apply(len)
    df_f['text_len_div'] = df_f['text_clean_len' ] / df_f['text_len']
    df_f['text_word_count'] = df['clean_excerpt'].apply(
        lambda x : len(x.split(' ')))
    
    df_f[['max_len_sent','min_len_sent','avg_len_sent']] = \
        df_f.apply(
            lambda x: get_sentence_lengths(x['excerpt']),
            axis=1, result_type='expand')
    
    return df_f
train = pd.concat(
    [train, create_features(train)], axis=1, copy=False, sort=False)
basic_f_columns = [
    'text_len', 'text_clean_len', 'text_len_div', 'text_word_count',
    'max_len_sent', 'min_len_sent', 'avg_len_sent']

When data is scarce, it is difficult to test hypotheses, and the results are usually unstable. Therefore, to be more confident in the results, I prefer to use OOF(Out-of-Fold) predictions in such cases. 

Baseline

I chose Catboost as the free library for the model. Catboost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing.

In Unconventional Sentiment Analysis: BERT vs. Catboost, I expanded on how Catboost worked with text and compared it with BERT.

This library has a killer feature: it knows how to work with embeddings. Unfortunately, at the moment, there is not a word about this in the documentation, and very few people know about this Catboost advantage.

!pip install catboost

When working with Catboost, I recommend using Pool. It is a convenience wrapper combining features, labels and further metadata like categorical and text features.

To compare experiments, I created a baseline model that uses only numerical and categorical features.

I wrote a function to initialize and train the model. By the way, I didn’t select the optimal parameters.

def fit_model_classifier(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=5000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=500,
        l2_leaf_reg=10,
        bootstrap_type='Bernoulli',
        subsample=0.7,
        **kwargs
    )
    return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
        plot=False,
        use_best_model=True)

For OOF implementation, I wrote a small and simple function.

def get_oof_classifier(
        n_folds, x_train, y, embedding_features,
        cat_features, text_features, tpo, seeds,
        num_bins, emb=None, tolist=True):
    
    ntrain = x_train.shape[0]
        
    oof_train = np.zeros((len(seeds), ntrain, num_bins))    
    models = {}

    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)    
      
        for i, (tr_i, t_i) in enumerate(kf.split(x_train, y)):
            if emb and len(emb) > 0:
                x_tr = pd.concat(
                    [x_train.iloc[tr_i, :],
                     get_embeddings(
                         x_train.iloc[tr_i, :], emb, tolist)],
                    axis=1, copy=False, sort=False)
                x_te = pd.concat(
                    [x_train.iloc[t_i, :],
                     get_embeddings(
                         x_train.iloc[t_i, :], emb, tolist)],
                    axis=1, copy=False, sort=False)
                columns = [
                    x for x in x_tr if (x not in ['excerpt'])]  
                if not embedding_features:
                    for c in emb:
                        columns.remove(c)
            else:
                x_tr = x_train.iloc[tr_i, :]
                x_te = x_train.iloc[t_i, :]
                columns = [
                    x for x in x_tr if (x not in ['excerpt'])] 
            x_tr = x_tr[columns]
            x_te = x_te[columns]                
            y_tr = y[tr_i]            
            y_te = y[t_i]

            train_pool = Pool(
                data=x_tr,
                label=y_tr,
                cat_features=cat_features,
                embedding_features=embedding_features,
                text_features=text_features)

            valid_pool = Pool(
                data=x_te,
                label=y_te,
                cat_features=cat_features,
                embedding_features=embedding_features,
                text_features=text_features)

            model = fit_model_classifier(
                train_pool, valid_pool,
                random_seed=seed,
                text_processing=tpo
            )
            oof_train[iseed, t_i, :] = \
                model.predict_proba(valid_pool)
            models[(seed, i)] = model
            
    oof_train = oof_train.mean(axis=0)
    
    return oof_train, models

I’ll write about the get_embeddings function below, but it is not used to get the model’s baseline for now.

I trained the baseline model with the following parameters:

columns = ['license', 'url_legal'] + basic_f_columns 
oof_train_cb, models_cb = get_oof_classifier(
    n_folds=5,
    x_train=train[columns],
    y=train['target_q'].values,
    embedding_features=None,
    cat_features=['license'],
    text_features=['url_legal'],
    tpo=tpo,
    seeds=[0, 42, 888],
    num_bins=num_bins
)

Quality of the trained model:

roc_auc_score(train['target_q'], oof_train_cb, multi_class="ovo")
AUC: 0.684407

Now I have a benchmark for the model’s quality.  Judging by the numbers, the model turned out weak, and I would not implement it in production.

Embeddings

You can translate multidimensional vectors into embedding, which is a relatively low-dimensional space. Thus, embeddings simplify machine learning for large inputs such as sparse vectors representing words. Ideally, embedding captures some of the input semantics by placing semantically similar inputs close to each other in the embedding space.

There are many ways to obtain such vectors, and I do not consider them in this article since this is not the purpose of the study. However, it is enough for me to get embeddings in any way; the main thing is that they save the necessary information. In most cases, I use the popular method at the moment — pre-trained transformers.

from sentence_transformers import SentenceTransformer
STRANSFORMERS = {
    'sentence-transformers/paraphrase-mpnet-base-v2': ('mpnet', 768),
    'sentence-transformers/bert-base-wikipedia-sections-mean-tokens': ('wikipedia', 768)
}
def get_encode(df, encoder, name):    
    device = torch.device(
        "cuda:0" if torch.cuda.is_available() else "cpu")

    model = SentenceTransformer(
        encoder, 
        cache_folder=f'./hf_{name}/'
    )
    model.to(device)
    model.eval()
    return np.array(model.encode(df['excerpt']))
def get_embeddings(df, emb=None, tolist=True):
    
    ret = pd.DataFrame(index=df.index)
    
    for e, s in STRANSFORMERS.items():
        if emb and s[0] not in emb:
            continue
        
        ret[s[0]] = list(get_encode(df, e, s[0]))
        if tolist:
            ret = pd.concat(
                [ret, pd.DataFrame(
                    ret[s[0]].tolist(),
                    columns=[f'{s[0]}_{x}' for x in range(s[1])],
                    index=ret.index)],
                axis=1, copy=False, sort=False)
    
    return ret

Now I have everything to start testing different versions of the models.

Models

I have several options for fitting models:

  • text features;
  • embedding features;
  • embedding features like a list of separated numerical features.

I have consistently trained various combinations of these options, which allowed me to conclude how useful embeddings might be, or, perhaps, it is just an over-engineering.

As an example, I give a code that uses all three options:

columns = ['license', 'url_legal', 'clean_excerpt', 'excerpt'] 
oof_train_cb, models_cb = get_oof_classifier(
    n_folds=FOLDS,
    x_train=train[columns],
    y=train['target_q'].values,
    embedding_features=['mpnet', 'wikipedia'],
    cat_features=['license'],
    text_features=['clean_excerpt','url_legal'],
    tpo=tpo,
    seeds=[0, 42, 888],
    num_bins=num_bins,
    emb=['mpnet', 'wikipedia'],
    tolist=True
)

For more information, I trained models on both GPU and CPU; and summarized the results in one table.

GPU/CPU results

The first thing that shocked me was the extremely poor interaction of the text feature and embeddings. Unfortunately, I don't have any logical explanation for this fact yet — here, a more detailed study of this issue on other datasets is required. In the meantime, note that the combined use of text and embedding for the same text can bring down the quality of the model.

And another revelation for me was that embeddings do not work when training modes on a CPU.

And now a good thing — if you have a GPU and could get embeddings, the best quality will be when you simultaneously use embedding both as a feature and as a list of separate numerical features.

Summary

In this article, I:

  • selected a small free dataset for tests;
  • created several statistical features for the text data to use them for creating a baseline model;
  • tested various combinations of embeddings, texts, and simple features;
  • got some non-obvious insights.

I hope that this little-known information will be useful to the community and benefit from your projects. Unfortunately, the functionality of Catboost for working with embeddings and texts is still raw. But, it is actively being improved, and I hope there will be a stable release soon, and developers will update the documentation. The complete code to reproduce the results from this article is available here.

Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Building a Real-Time Change Data Capture Pipeline With Debezium, Kafka, and PostgreSQL
  • Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Exploring Intercooler.js: Simplify AJAX With HTML Attributes

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!