Curious Case of Actuarial Science, Geocoding, and Machine Learning
Curious Case of Actuarial Science, Geocoding, and Machine Learning
Let's take a look at actuarial science, geocoding, and Machine Learning as well as how they relate to one another.
Join the DZone community and get the full member experience.Join For Free
This article illustrates how Geocoding uncovers the untapped value within generally overlooked insurance categories, such as Life and Annuity, and how it can help address modern-day business challenges remarked by Orszag. While Geocoding in Big Data is gaining prominence within Property and Casualty (P&C), we believe the real opportunity lies in the actuarial adoption of AI framework capable of processing consumable inputs that weren’t visible in the erstwhile "Ease of Geocoding" era.
Establishing this premise for Life and Annuity, we then pivot towards crafting a general purpose Geo-inclusive architecture that can help actuaries of all disciplines apply Machine Learning to solve new generation of business problems, such as, dwindling subscribers or risk-attributed challenges, such as, Adverse Selection.
Nearly all of the data in the insurance business has a location attribute, e.g. address of a policyholder or location of an incident. However, many insurance companies have not fully utilized this component besides billing and mailing purposes. Randall E. Brubaker, FCAS, observes that several providers are still relying on zip code-based risk calculations, notwithstanding the fact that zip codes are incongruent and subject to change. A location, on the other hand, helps determine the precise address of risk.
With the advent of Geocoding, location-based risk computation is not only simplified, but the efficiency of Big Data helps drive millions of such computations in a matter of seconds. As a quick reminder, Geocoding is the process of assigning a unique identifier (UID) to an address or location or geographic shape. This capability helps an analyst view an underwritten business through the lens of demographic or geographic attributes that could be combined in such representations to investigate or search for possible relationships. Let's see how this applies to Life and Annuity.
Life and Annuity
Life and Annuity firms set aside reserves to hedge on market indexes based on certain assumptions generated with actuarial models. These models create buckets of population based on a few factors, usually five or less. These factors are used in segmenting a book of business into even set of distributions so that aggregate ratios on attrition, withdrawals, and other factors can be computed. This helps with several analytical workloads such as trend analysis done on a year over year basis.
According to IBM fellow of Cognitive Decision Support Services, Sankar Virdhagriswaran, five or fewer variables make it nearly impossible to identify emerging trends attributed to a specific population or location. With Adverse Selection from a limited set of variables, policies are priced based on broad characteristics shifting the burden of coverage to Underwriting. The resultant wide classification of risk categories leads to higher pricing on annuity contracts and, subsequently, a larger pool of fund allocated for market hedging, which in turn, leads to limited investment options.
This form of unyielding impairs insurers from reacting to changing market conditions quickly or creating timely products that could be marketed at locations better suited for specific subscriber population like Gen X, Millennials and Gen Y. This could be one of the contributing factors to Orszag’s puzzle of drop-in Life subscription from 77 percent in 1989 to just 60 percent today.
Addressing Adverse Selection, what if a rich vocabulary of statistically significant "Geocodable" variables was available not just at zip code level but rather at a granular Block Group level to help address risk entities quarterly, not yearly? E.g. how does a daily commute from a low traffic density area to a gridlocked location impact my mortality? Does my life expectancy change if I relocated from clean outskirts to an industrial belt within the same zip code? Should my Life premium be adjusted if I shifted from a hurricane-prone region in the east to an earthquake zone in the west? What if actuarial assumptions in Life accounted for a similar set of external attributes to determine the risk of attrition, withdrawal, and surrender?
In 2009, JAMA Internal Medicine ran experiments revealing three groups of statistically significant variables applicable for health risk calculations. The table below lists them along with their levels of variance (R2).
This builds a case for the need of a similar trial to identify "Geocodable" variables for modeling, not just within Life and Annuity, but all disciplines of actuarial science. Moreover, the process of surfacing these hidden attributes has to be made simple and scalable for actuaries to adapt without having to invest time and resources in technology. Let’s evaluate how a self-servicing framework can help.
Geo-Inclusive AI Framework for Actuarial Learning
A modern intelligent framework cannot sustain without considering for Economies of Scale. A study by IDC and IDG Enterprise shows companies are experiencing an exponential growth of data led by Geolocation with IoT coming in a close second. As more data continues to be collected, streams of significant attributes are expected to follow with Geolocation, Wearables, and Socioeconomic data continuously dropped into Big Data.
With the massive affordability of Big Data (at one-twentieth the cost), it’s inescapable for insurance firms to extend their modern data architectures to be Geo-inclusive and enjoy low cost of failures with enterprise-wide experimentation. It may be noted that Geocoding in Big Data has the potential of introducing hundreds of clean and ready-for-consumption variables, requiring actuaries to include Dimension Reduction, a common process used in Machine Learning, to consolidate the number of random variables.
This leads us to envision an Information Rendering Framework (IRF) that can recommend relevant attributes. Mixing incumbent variables with Geocoded ones, IRF will apply Dimension Reduction techniques like Principal Component Analysis (PCA) and Correlation-based Feature Selection (CFS) to select the best collection of variables for training and evaluation for best fit. These variables can then be applied towards Machine Learning techniques such as REPTree, Multi-Linear Regression, Artificial Neural Network, and Random Tree Classifiers within actuarial cycles to enhance the accuracy and relevance with better precision to predict the risk of the applicant.
A quantitative exercise in Kaggle by Noorhannah Boodhun and Manoj Jayabalan on a dataset with 128 attributes, volunteered by Prudential Life, revealed that REPTree algorithm showed the highest performance for the CFS method, whereas Multi-Linear Regression showed best performance for PCA.
With better classification and higher representation, IRF not only solves Adverse Selection issue but also certifies results for minimal biases and reduced probability of overlooked ‘Geocodable’ attributes. To validate this, let’s study two cases quoted from Institute and Faculty of Actuaries, where actuaries have successfully identified Geocoded datasets applying strategies similar to IRF.
Supervised Machine Learning was successfully applied in Exposure Management where learning was applied as a data cleansing tool to predict missing "Geocodable" property attributes such as year built and number of floors. The results showed that Stochastic Gradient Boosting was the most accurate learning algorithm for both of the variables.
The Total Insured Value or property value was the most influential feature in predicting the number of floors of a building. The latitude and longitude of a property proved to have the greatest influence in predicting the year a building was built. Other key findings were that the ‘number of floors’ model had an accuracy or Poisson deviance error of 1 floor. The "year built" model had an accuracy or root mean squared error (RMSE) of 11.51 years. This report finds that Machine Learning is beneficial in finding definite patterns and predictions from trained data to complete blank unfilled data.
Interest Rate Forecasting
Interest rate forecasting is of importance in various actuarial practice areas such as Life Insurance, Asset Liability Management, Liabilities Valuation, and Capital Modeling. This case study describes a model that reads and provides sentiment analysis on central bank communications. Central banks, like the Bank of England (BoE), exert vast influence on the level of interest rate via monetary policies. The tone or sentiment in central banks communications sets an expectation in the market.
Supervised Machine Learning techniques, such as Geocoding location expressions in Twitter messages, were used to train an ensemble model that classified BoE communications in a fully automated and scalable way. The results of the sentiment analysis were used in an interest-rate forecasting model. Given the inherent uncertainty in making forecasts, the interest-rate forecasting model provided a range of feasible outcomes.
Actuarial professionals are no stranger to Big Data and analyzing unstructured datasets, thereby rendering adoption of IRF, an easier curve to follow. Moreover, Geocoding and Machine Learning relieves actuaries from the requirement that all data be assigned to limited groups of pre-determined classifications.
Geocoding on Big Data opens up a broader range of analysis to be performed with regards to how an insured entity relates to its location, demographic segment etc. While Big Data allows modelers to run assumptions and subsequent computations on hundreds of millions of views based on multiple classifications, Machine Learning helps automate complex attribute validation involving a lot more than just five variables.
Organizationally, actuaries will face several friction points when fitting a data science lifecycle within actuarial control cycles owing to the fact that one is not regulated while the other is heavily regulated. Thus an integrated path from strategy planning to production is critical for such initiatives to succeed.
To avoid the risk of derailing of such projects, V-Squared Founder and Chief Data Scientist, Vin Vashishta, suggests an organizational strategy of taking the data science team and making it a startup within the company to quickly remove any incompatible parts of the process. Instead of taking 1-2 years, the integration timeline drops down to 3-6 months. This also brings clarity from an ROI standpoint as revenue planning is now based on validated prototypes and not just hopeful projects. Companies adopting this strategy will stand to win the next generation of challenges faced by the Insurance industry.
Opinions expressed by DZone contributors are their own.