# Credit Scoring: Data Preparation and Exploratory Data Analysis

# Credit Scoring: Data Preparation and Exploratory Data Analysis

### As much as we all may despise credit scores, the way that they are calculated is actually pretty fascinating. Come learn about the process!

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

"Garbage in, garbage out" is a commonly used axiom in computer science and a threat to a project's success — the quality of output is largely determined by the quality of input. Therefore, data preparation is a key aspect of any data mining project including development of a credit scorecard. This is the most challenging and time-consuming phase of the CRISP-DM cycle. At least 70%, sometimes more than 90%, of total project time is dedicated to this activity. It involves data collection, combining multiple data sources, aggregations, transformations, data cleansing, "slicing and dicing," and looking at the breadth and depth of data to get a clear understanding and to transform quantity of data into quality of data so that we can prepare confidently for the next phase: model building.

The previous article in this series discussed the importance of a model design and identified its main components including unit of analysis, population frame, sample size, criterion variable, modeling windows, data sources, and data collection methods. Careful consideration of each of the components is imperative for successful data preparation. The final product of this stage is a mining view encompassing the right level of analysis, modeling population, and independent and dependent variables.

*Table 1: Model design components*

## Data Sources

"The more, the merrier," they say. As part of data understanding step, any external and internal data sources should provide both quantity and quality. Data utilized must be relevant, accurate, timely, consistent, and complete while being of sufficient and diverse volume to provide a useful result in analysis. For application scorecards where there is a limited amount of internal data, external data has prevalence. In contrast, behavior scorecards utilize more of the internal data and are typically superior in terms of predictive power. The common data sources that would be required for customer verification, fraud detection, or credit grant are outlined below.

*Table 2: Data sources diversity*

## The Process

The data preparation process starts with data collection, commonly known as the ETL process (extract-transform-load). Data integration combines different sources using data merging and concatenation. Typically, it requires manipulation of relational tables using a number of integrity rules such as entity, referential, and domain integrity. Using one-to-one, one-to-many, or many-to-many relationships, the data is aggregated to the desired level of analysis so that a unique customer signature is produced.

*Figure 1: Data preparation process*

Data exploration and data cleansing are mutually iterative steps. Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.

* Figure 2: EDA (Univariate View)*

* Figure 3: EDA (Characteristic Analysis)*

Following exploratory data analysis (EDA), the data is treated in order to increase the quality. Data cleansing requires good business and data understanding so the data can be interpreted in the right way. This is an iterative process designed to remove irregularities and replacing, modifying, or deleting these irregularities as appropriate. Two major issues with unclean data are missing values and outliers; both can heavily affect the model accuracy and careful intervention is therefore imperative.

Before a decision is made on how to treat missing values, we need to understand the reason for missing data and understand the distribution of missing data. We can categorize it as one of the following:

- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)

Missing data treatment often assumes MCAR and MAR, while NMAR is more difficult to deal with. The list below provides the common treatments ordered by complexity.

*Table 3: Missing data treatments*

Outliers are another "beast" in our data, as their presence can violate statistical assumptions under which we develop a model. Once identified, it is important to understand the reasons for having outliers before applying any treatment. For example, outliers could be a valuable source of information in fraud detection; hence, it would be a bad idea to replace them with a mean or median value.

Outliers should be analyzed using univariate and multivariate analysis. For detection, we can use visual methods such as histograms, box-plots, or scatter plots and statistical methods such as mean and standard deviation, clustering by examining distant clusters, small decision tree leaf nodes, Mahalanobis distance, Cook's D, or Grubbs' test. The judgment of what should be considered an outlier is not as straightforward as identifying missing values. The decision should be based upon a specified criterion, for example, any value outside ±3 standard deviations, or ±1.5IQR, or 5th-95th percentile range would be labeled as an outlier.

Outliers can be treated in a similar way as missing values. Other transformations can also be utilized, including binning, weights assignment, conversion to missing values, logarithm transformations to eliminate the influence of extreme values or Winsorization.

As discussed above, data cleansing may involve the implementation of different statistical and machine learning techniques. Even though these transformations could create more superior scorecard models, the practically of implementation must be considered as complex data manipulations can be difficult to implement, can be costly, and can slow down model processing performance.

Once the data is clean, we are ready for a more creative part: data transformations. Data transformation, or feature engineering, is the creation of additional (hypothesized) model variables that are tested for significance. The most common transformations include binning and optimal binning, standardization, scaling, one hot encoding, interaction terms, mathematical transformations (from non-linear into linear relationships and from skewed data into normally distributed data), and data reduction using clustering and factor analysis.

Apart from some general recommendations on how to tackle this task, it is the responsibility of the data scientist to suggest the best approach to transforming the customer data signature into a powerful information artifact: the mining view. This is probably the most creative, and the most challenging, aspect of the data scientist role, as it requires a solid grasp of business understanding in addition to statistical and analytical skills. Very often, the key to creating a good model is not the power of a specific modeling technique but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.

The rest is the art of feature creation...

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Natasha Mashanovich , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}