How Insurers Can Use Machine Learning Algorithm Like Logistic Regression
Logistic regression helps in predicting the predictability of any outcome of data based on predictor variables. A company which name is kept confidential but they shared original data on engineering related insurance which excel file is annexed at the end of this page. On this data, all data preprocessing steps are applied before logic regression model and other regression models like Ridge, Lasso, or ElasticNet which assisted in improving the model performance by addressing overfitting, handling multicollinearity, and performing feature selection.
After analyzing the below logistic regression analysis performed on Jupiter notebook which link is shared at the end of this page on GitHub link, insurer can improve following sections:-
- Pricing: Use loss ratio trends and periods to set fair premiums.
-
Risk Management: Identify treaties with high claims or poor profit ratios.
-
Reserving: Outstanding and incurred values inform reserve estimation.
-
Strategic Decisions: Understand which treaty periods (early/mid/late) are more risky or profitable. Insurance analytics like logistic regression in insurance offers simplicity, interpretability, and accuracy. It enables faster decisions in underwriting, fraud detection, and customer retention, while supporting compliance and transparency through its clear probabilistic framework.
-
Following are the variables of features opted for analysis: Re-takaful Contribution, Rebate Amount, Paid Losses, Outstanding, Incurred, Loss Ratio %, Profit Balance, Underwriting Rebate, Net Balance and Profit Ratio %. Profit Ratio is further classified into new column named profit class. Profit ratio is classified in profit class as either profitable (
1
) or less profitable (0
) based on a chosen threshold (here,50%
). Logistic regression model is trained on below given data where profit class is target variable which is predictable based on above mentioned features or variables. This model can be used by insurer or takaful operators to examine the profitability if they change the amount of above said features or variables. This model is used to predict whether a contract is profitable. This model has Accuracy: 1.00 (perfect prediction on 6 test cases), Precision, Recall, F1: 1.00. Caveat is Small sample size risks overfitting which is minimized by applying regularization techniques like Ridge: R² = 0.62, Lasso: R² = 0.63, ElasticNet: R² = 0.69. ElasticNet performed best, balancing bias and variance.
Logistic regression is a one technique for insurers to predict the profitability. It assists in taking fast decision regarding underwriting, claims, and pricing.

13 Essential Machine Learning Steps to Predict Insurance Profitability Using Logistic Regression
Step 1: Load the Dataset
Above code is applied to read and upload the file titled engineering.xlsx into a DataFrame.
Step 2: Univariate Analysis (Single Variable at a Time)
-
Above code is applied to show mean, min, and max for each column and histograms for data distribution.
-
Retakaful Contribution
-
Purpose: This is the amount paid by the insurance company to the retakaful (reinsurance) provider for risk coverage.
-
Observations (n): 28
-
Mean: 6,505,841 – On average, companies contributed about 6.5 million.
-
Standard Deviation: 11,195,250 – Contributions vary widely; some contracts are much larger.
-
Minimum: 292,952 – The smallest contribution made.
-
25th Percentile (Q1): 2,214,000 – 25% of contributions are below this amount.
-
Median (Q2): 3,581,712 – Half of the contributions are less than about 3.6 million.
-
75th Percentile (Q3): 6,597,710 – 75% of contributions are below this value.
-
Maximum: 60,721,330 – One very large contract paid over 60 million.
2. Rebate Amount
-
Purpose: The refund or reward the insurer receives based on favorable loss outcomes.
-
Observations (n): 28
-
Mean: 1,144,158 – Average rebate received was just over 1.1 million.
-
Standard Deviation: 1,958,221 – Indicates considerable variation among cases.
-
Minimum: 58,590 – Smallest rebate received.
-
25th Percentile: 428,817 – A quarter of contracts received less than this amount.
-
Median: 615,394 – Half received less than this; the rest more.
-
75th Percentile: 1,182,795 – 75% of rebates are under this value.
-
Maximum: 10,678,790 – Highest rebate, from a high-performing or low-claim contract.
3. Paid Losses
-
Purpose: The total amount paid out by the retakaful provider to cover claims.
-
Observations (n): 28
-
Mean: 4,413,668 – On average, 4.4 million was paid per contract.
-
Standard Deviation: 8,531,315 – Shows some contracts incurred very high claims.
-
Minimum: 0 – Some contracts had no claims.
-
25th Percentile: 438,758 – 25% of the contracts paid out less than this.
-
Median: 1,110,052 – Half of the contracts paid out less than this.
-
75th Percentile: 5,261,579 – 75% of paid losses fall under this value.
-
Maximum: 41,160,520 – One contract had an exceptionally high loss.
4. Losses %
-
Purpose: The proportion of losses compared to the contribution (how much was lost for every unit contributed).
-
Observations (n): 28
-
Mean: 98.5% – On average, nearly the entire contribution amount was lost, indicating marginal profitability.
-
Standard Deviation: 134.36 – Suggests wide variability in contract performance.
-
Minimum: 0% – Indicates some contracts experienced no loss.
-
25th Percentile: 7.5% – 25% of contracts had very low loss ratios.
-
Median: 29.5% – Half the contracts had less than 30% loss, which is acceptable.
-
75th Percentile: 186.75% – 25% of contracts lost more than 186% of what they received, indicating large underwriting losses.
-
Maximum: 560% – One contract paid out more than five times the contribution amount, a significant financial concern.
-
5. Outstanding
-
Purpose: Claims reported but not yet paid by the retakaful provider (Reserve).
-
Observations: 8 (fewer entries likely due to missing data).
-
Mean: 510,639 – On average, about half a million is pending.
-
Standard Deviation: 433,197 – Moderate variation across contracts.
-
Min–Max Range: 15,191 to 1,380,076 – Some contracts had minimal pending amounts, others significantly more.
6. Incurred
-
Purpose: Total of paid + outstanding losses (true total cost of claims).
-
Observations: 28
-
Mean: 4,685,160 – Average cost per contract.
-
Standard Deviation: 8,673,657 – High variation, indicating some very costly contracts.
-
Min–Max Range: 0 to 42,470,520 – A few contracts incurred exceptionally large claims.
7. Loss Ratio %
-
Purpose: Ratio of incurred losses to retakaful contribution (measure of underwriting performance).
-
Observations: 28
-
Mean: 100.5% – On average, losses slightly exceeded contributions.
-
Standard Deviation: 133.05 – Significant variability in contract performance.
-
Min–Max Range: 0% to 560% – Some contracts yielded no losses, others faced extreme losses.
8. Profit Balance
-
Purpose: Net financial result after accounting for claims and rebates.
-
Observations: 28
-
Mean: 476,480 – On average, a small profit was earned.
-
Standard Deviation: 5,452,523 – Very large fluctuations, some contracts lost or earned millions.
-
Min: -14,544,950 – Worst loss recorded.
-
Max: 11,025,520 – Highest profit recorded.
9. Underwriting Rebate Net Balance
-
Purpose: Financial balance after applying underwriting rebate adjustments.
-
Identical to Profit Balance: This column duplicates the same values, indicating either a naming duplication or that all profits stemmed from underwriting.
-
Note: Interpretation same as Profit Balance.
10. Profit Ratio %
-
Purpose: Percentage profit or loss relative to retakaful contribution.
-
Observations: 28
-
Mean: -23.64% – Indicates an average loss.
-
Standard Deviation: 131.55 – Large variability.
-
Min–Max Range: -468% to 84% – Some contracts lost nearly 5 times their contribution, while others made solid profits.
-
-
Step 3: Bivariate Analysis (Relationships Between Variables)
-
This code is applied for analyzing the graphical correlations between two variables.
Step 4: Handling Missing Data
-
This code is applied to fill the blank value with frequent value against categorical data and average value against numerical data.
Step 5: Handling Outliers
-
This code is applied to identify extreme values in numeric columns and to reduce this value to avoid distortion in analysis.
Step 6: Ordinal Encoding
-
This code is applied to put the categories in the orderly form (like early/late periods) in numbers.
-
Early period: 2014–2016 (1st–2nd rows)
-
Mid period: 2017–2019 (3rd–5th rows)
-
Late period: 2020–2023 (6th–9th rows)
Step 7: One-Hot Encoding
-
This code is applied for changing the categories without order into binary columns. Like, “Quota Share” becomes 1 or 0.
Step 8: Scaling and Normalization
-
StandardScaler is applied to bring variables in centered (mean = 0).
-
MinMaxScaler resizes them between 0 and 1.
-
This keeps different measurements (e.g., income, years) comparable.
Step 9: Power Transformation
-
This code is apply to show the data in”normal” (bell-shaped), for improving modeling accuracy. Power transformations fix this by reducing skewness, stabilizing variance, and making the data symmetric, thereby improving model performance.
Step 10: Binarization
-
This code shows the 1 for high profit (above 50%), and 0 for low profit.
-
Binarization is used to convert continuous numeric data into binary (0 or 1) categories. In your case:
-
You are transforming the
"Profit Ratio %"
into a new column"Profit_Class"
. -
This helps classify treaties as either profitable (
1
) or less profitable (0
) based on a chosen threshold (here,50%
).
-
Step 11: Principal Component Analysis (PCA)
-
It assists in opting very important features (dimensions) for better performance.
Step 12: Logistic Regression
-
It divides data into features (X) and target (y). Then, it is further divided into training (80%) and testing (20%) sets.
-
This code trains a model to predict profit class (high or low).
-
This code evaluates prediction results using standard metrics: accuracy, precision, recall, etc.
-
Logistic Regression is a classification algorithm used to predict a binary outcome.
In this case:-
0 = Low profit
-
1 = High profit
The model learns from input features (like losses, rebates, etc.) to estimate the probability of achieving a high profit.
-
-
The logistic regression model was applied to classify profit levels into high (1) and low (0) based on a threshold. The confusion matrix and performance metrics show:
-
Perfect prediction accuracy (100%) on the test set (6 observations).
-
4 cases of low profit and 2 cases of high profit were all correctly classified.
-
Precision, recall, and F1-score are all 1.00, indicating no false positives or negatives.
This suggests that the logistic model performed exceptionally well on the current data. However, caution is needed due to the small test size, which may lead to overfitting or optimistic accuracy.
Accuracy: 1.00 (6)
-
Definition: Accuracy is the ratio of correctly predicted instances to total instances.
-
Interpretation: Out of 6 total observations, all were predicted correctly.
-
1.00 = 100% accuracy, meaning no misclassifications.
Macro avg: 1.00 (6)
-
Definition: Macro average calculates the average of precision, recall, and F1-score for each class, treating all classes equally (regardless of how many samples are in each).
-
Interpretation:
-
Precision (avg): 1.00
-
Recall (avg): 1.00
-
F1-score (avg): 1.00
-
-
This means both classes (0 and 1) were predicted perfectly.
Weighted avg: 1.00 (6)
-
Definition: Weighted average takes into account the number of instances in each class, providing a more realistic average when class sizes are imbalanced.
-
Interpretation: Even after adjusting for the number of low profit (4 cases) and high profit (2 cases), the precision, recall, and F1-score all remain perfect at 1.00.
-
Step 13: Ridge, Lasso, and ElasticNet Regression
-
Following three models are applied for resolving the overfitting (fitting noise rather than real patterns):
-
Ridge lowering the weights using square penalty,
-
Lasso removing irrelevant features,
-
ElasticNet mixture of above both methods.
-
-
It is used for training the each model.
-
It measures how well each model explains the variation in profit classification.
-
Three linear regression models with regularization were used to predict the continuous profit ratio (%). Their performance (R² scores) was:
-
Ridge Regression: 0.62
-
Lasso Regression: 0.63
-
ElasticNet Regression: 0.69
These results show that the models explain 62% to 69% of the variance in profit ratio, with ElasticNet performing the best. This indicates moderate to good predictive ability, capturing the main trends in the data while controlling for overfitting.
you can view code and output on engineering related insurance data on this GitHub link:
https://github.com/imran345/Machine-Learning-Models-on-Insurance/blob/main/Untitled52.ipynb
you can view excel original data on engineering related insurance on this link:
-