DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • The Future of Big Data Analytics and Data Science: 10 Key Trends
  • How to Sell Data Analytics to Non-Data Scientists
  • Data Science vs. Data Analytics
  • What Is Data Analytics? Understanding Data Analytics Techniques

Trending

  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
  • Data Lake vs. Warehouse vs. Lakehouse vs. Mart: Choosing the Right Architecture for Your Business
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 2
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Variable Selection and Big Data Analytics in Credit Score Modeling

Variable Selection and Big Data Analytics in Credit Score Modeling

The variable selection process in the credit score modeling process is critical to finding key information. Learn how to do it to get a good understanding of your data!

By 
Natasha Mashanovich user avatar
Natasha Mashanovich
·
Oct. 12, 17 · Opinion
Likes (3)
Comment
Save
Tweet
Share
10.2K Views

Join the DZone community and get the full member experience.

Join For Free

"Doing more with less" is the main philosophy of credit intelligence, and credit risk models are the means to achieve this goal. Using an automated process and focussing on the key information, credit decisions can be made in seconds and can eventually reduce operational cost by making the decision process much faster. Fewer questions and rapid credit decisions ultimately increase customer satisfaction. For lenders this means expanding their customer base, taking on board less risky customers and increasing the profit.

How can we achieve parsimony, and what is the key information to look for? The answer is found during the next step of the credit risk modeling process: the variable selection process.

The mining view created as the result of data preparation is a multi-dimensional unique customer's signature used to discover potentially predictive relationships and test the strength of those relationships. A thorough analysis of the customer's signature is an important step when creating a set of testable hypotheses based on the characteristics found in the customer's signature. Often referred as business insights, this analysis provides an interpretation of trends in customer behavior, which aims to direct the modeling process.

The purpose of the business insights analysis is to:

  1. Validate that the derived customer's data is in line with business understanding. For example, insight analysis should support the business statement that customers with higher debt-to-income ratio are more likely to default.
  2. Provide benchmarks for analyzing model results.
  3. Shape the modeling methodology.

Business insights analysis utilizes similar techniques to exploratory data analysis by combining univariate and multivariate statistics and different data visualization techniques. Typical techniques are correlation, cross-tabulation, distribution, time-series analysis, and supervised and unsupervised segmentation analysis. Segmentation is of special importance, as it determines when multiple scorecards are needed.

Variable selection, based on the results of the business insights analysis, starts by partitioning the mining view into at least two different partitions: training and testing partition. The training partition is used to develop the model, and the testing partition is used for assessing the model's performance and validating the model.

Figure 1: Simplified scorecard model building process

Variable Selection

Variable selection is a collection of candidate model variables tested for significance during model training. Candidate model variables are also known as independent variables, predictors, attributes, model factors, covariates, regressors, features, or characteristics.

Variable selection is a parsimonious process that aims to identify a minimal set of predictors for maximum gain (predictive accuracy). This approach is the opposite of data preparation, where as many meaningful variables as possible are added to the mining view. These opposing requirements are achieved using optimization; that is, finding the minimal selection bias under the given constraints.

The key objective is to find a right set of variables so the scorecard model would be able not only to rank customers based on their likelihood of bad debt but also to estimate the probability of their bad debt. This usually means selecting statistically significant variables in the predictive model and having a balanced set of predictors (usually 8-15 is considered a good balance) to converge to a 360-degree customer view. In addition to customer-specific risk characteristics, we should also consider including systematic risk factors to account for economic drifts and volatilities.

Easier said than done; when selecting variables, there are a number of limitations. First, the model will usually contain some highly predictive variables — the use of which is prohibited by legal, ethical or regulatory rules. Second, some variables might not be available or might be of poor quality during modeling or production stages. In addition, there might be important variables that have not been recognized as such, for example, because of a biased population sample, or because their model effect would be counter-intuitive as a result of multicollinearity. And finally, the business will always have the last word and might insist that only business-sound variables are included or request monotonically increasing or decreasing effects.

All of these constraints are potential sources of bias, which gives the data scientists a challenging task to minimise the selection bias. Typical preventive measures during variable selection include:

  • Collaboration with experts in the field to identify the important variables.
  • Awareness of any problems in relation to data source, reliability or mismeasurement.
  • Cleaning the data.
  • Using control variables to account for banned variables or specific events such as an economic drift.

It is important to recognize that variable selection is an iterative process that occurs throughout the model building process.

  • It starts prior to model fitting by reducing the number of variables in the mining view to a manageable set of candidate variables.
  • Continues during the model training process, where further reduction is implemented as result of statistical insignificance, multicollinearity, low contributions, or penalization to avoid overfitting.
  • Carries on during model evaluation and validation.
  • Finalizes during the business approval, where model readability and interpretability play the important part.

Variable selection finishes after the "sweet spot" has been reached — meaning that no more improvement can be achieved in terms of model accuracy.

Figure 2: Iterative nature of variable selection process

A plethora of variable selection methods is available. With advances in machine learning, this number has been constantly increasing. Variable selection techniques depend on whether we use variable reduction or variable elimination (filtering), whether the selection process is carried out inside or outside predictive models; whether we use supervised or unsupervised learning; or if the underlying methods are based on specific embedded techniques such as cross-validation.

Image title

Table 1: Variable selection methods typical in credit risk modeling

Figure 3: Variable selection using bivariate analysis

In credit risk modeling, two of the most commonly used variable selection methods are information value for filtering prior to model training and stepwise selection for variable selection during the training of a logistic regression model. Although both receive some criticism from practitioners, it is important to recognize that no ideal methodology exists as each of the methods for variable selection has its pros and cons. Which one to use and how best to combine them is not an easy task to solve and requires solid domain knowledge, a good understanding of the data, and extensive modeling experience.

Big data Data science Analytics

Published at DZone with permission of Natasha Mashanovich, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Future of Big Data Analytics and Data Science: 10 Key Trends
  • How to Sell Data Analytics to Non-Data Scientists
  • Data Science vs. Data Analytics
  • What Is Data Analytics? Understanding Data Analytics Techniques

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!