Insurance Prediction (Linear Regression) using Pytorch
Medical Insurance prediction using linear regression
In this notebook, we will work on the medical insurance dataset and work on the linear regression model using Pytorch.
Source for the dataset is https://www.kaggle.com/mirichoi0218/insurance
import torch
import torchvision
import torch.nn as nn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
df = pd.read_csv("./input/insurance.csv")
# First five rows
df.head()
# Last five rows
df.tail()
# Statistical information for our dataset
df.describe()
What can we infer from descriptive stats?
Average age of customers is about 39 years with maximum age of 64 years and they have one child on an average with minimum of no child and maximum of 5 children. 75% of observations show 51 years of age and 2 children. The charges for insurance on an average is 13270.42 units with 75% obseravtions close to 16639.91 units.
# Data types for all columns
df.info()
# See if there are any missing values
print(df.isnull().sum())
Let's visualize the target variable - 'expenses' first and see its distribution using histogram.
plt.figure(figsize=(6,6))
plt.hist(df.expenses, bins = 'auto', color = 'purple')
plt.xlabel("expenses")
plt.title("Distribution of expenses")
What can we infer from data distribution?
- Most of the expenses are between 100 and 10000
- Very few people are charged above 50000
- Mean is 13270 (from descriptive stats) and the data is towards the left of distribution
Let's now plot the outliers - expenses vs other variables
cols = ['age', 'children', 'sex', 'smoker', 'region']
for col in cols:
plt.figure(figsize=(8,8))
sns.boxplot(x = df[col], y = df['expenses'])
What can we infer from box-plots?
- As age increases, the insurance cost increases i.e. younger people have less cost
- Customers with 2 children are charged more
- Being male or female has no impact but males have high cost range
- Smokers are charged higher
- Region does not show much correlation with charges, though, South-east region have larger range up to about 20,000 in its dsitribution of customer charges.
In our dataset, we have three qualitative variables i.e. sex, smoker and region. We can convert these to quantitative variables for a better model. Categorical varaibles can simply be assigned a binary value since they have only two values. Region can be converted using Panda's get_dummies method.
df.region = pd.get_dummies(df.region)
df.region
df.smoker
df.smoker
df.smoker = [1 if x == 'yes' else 0 for x in df.smoker]
df.sex = [1 if x == 'male' else 0 for x in df.sex]
df.smoker
df.expenses = pd.to_numeric(df.expenses)
# Create Correlation matrix for all features of data.
df.corr()
# Generate heatmap to visualize strong & weak correlations.
sns.heatmap(df.corr().round(2), square=True, cmap='RdYlGn', annot=True)
From the heatmap, we can infer that the expenses & smoker have the highest co-relation and expenses & region have the lowest co-relation.
sns.pairplot(df)
targets_df = df.expenses
inputs_df = df.drop(['expenses'], axis=1)
inputs_df
# Convert Dataframe to Numpy Arrays
inputs_narray = inputs_df.to_numpy('float32')
target_narray = targets_df.to_numpy('float32')
# Create Pytorch Tensors from Numpy Array's
inputs = torch.from_numpy(inputs_narray)
targets = torch.from_numpy(target_narray)
inputs.dtype
inputs.dtype, targets.dtype
num_rows = inputs.shape[0]
Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a TensorDataset
.
dataset = TensorDataset(inputs, targets)
Q: Pick a number between 0.1
and 0.2
to determine the fraction of data that will be used for creating the validation set. Then use random_split
to create training & validation datasets.
val_percent = 0.10 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size
train_ds, val_ds = random_split(dataset, [train_size , val_size ])
Finally, we can create data loaders for training & validation.
batch_size = 64
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)
Let's look at a batch of data to verify everything is working fine so far.
for xb, yb in train_loader:
print("inputs:", xb)
print("targets:", yb)
break
Hint: Think carefully about picking a good loss fuction (it's not cross entropy). Maybe try 2-3 of them and see which one works best. See https://pytorch.org/docs/stable/nn.functional.html#loss-functions
input_size = len(inputs_df.columns)
output_size = 1
class InsuranceModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(input_size, output_size)
def forward(self, xb):
out = self.linear(xb)
return out
def training_step(self, batch):
inputs, targets = batch
# Generate predictions
out = self(inputs)
# Calcuate loss
loss = F.smooth_l1_loss(out,targets)
return loss
def validation_step(self, batch):
inputs, targets = batch
# Generate predictions
out = self(inputs)
# Calculate loss
loss = F.smooth_l1_loss(out,targets)
return {'val_loss': loss.detach()}
def validation_epoch_end(self, outputs):
batch_losses = [x['val_loss'] for x in outputs]
epoch_loss = torch.stack(batch_losses).mean() # Combine losses
return {'val_loss': epoch_loss.item()}
def epoch_end(self, epoch, result, num_epochs):
# Print result every 20th epoch
if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))
Let us create a model using the InsuranceModel
class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes nan
or infinity
.
model = InsuranceModel()
Let's check out the weights and biases of the model using model.parameters
.
list(model.parameters())
def evaluate(model, val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)
def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
history = []
optimizer = opt_func(model.parameters(), lr)
for epoch in range(epochs):
# Training Phase
for batch in train_loader:
loss = model.training_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Validation phase
result = evaluate(model, val_loader)
model.epoch_end(epoch, result, epochs)
history.append(result)
return history
Use the evaluate
function to calculate the loss on the validation set before training.**
result = evaluate(model, val_ds)
print(result)
We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan
), you may have to re-initialize the model by running the cell model = InsuranceModel()
. Experiment with this for a while, and try to get to as low a loss as possible.
Q: Train the model 4-5 times with different learning rates & for different number of epochs.
Hint: Vary learning rates by orders of 10 (e.g. 1e-2
, 1e-3
, 1e-4
, 1e-5
, 1e-6
) to figure out what works.
epochs = 100
lr = 1e-6
history1 = fit(epochs, lr, model, train_loader, val_loader)
epochs = 100
lr = 1e-5
history2 = fit(epochs, lr, model, train_loader, val_loader)
epochs = 100
lr = 1e-4
history3 = fit(epochs, lr, model, train_loader, val_loader)
epochs = 100
lr = 1e-3
history4 = fit(epochs, lr, model, train_loader, val_loader)
epochs = 100
lr = 1e-2
history5 = fit(epochs, lr, model, train_loader, val_loader)
val_loss = [result] + history1 + history2 + history3 + history4 + history5
val_loss_list = [vloss['val_loss'] for vloss in val_loss]
plt.plot(val_loss_list, '-x')
plt.title('Validation Loss vs. Number of epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
def predict_single(input, target, model):
inputs = input.unsqueeze(0)
predictions = model(inputs)
prediction = predictions[0].detach()
print("Input:", input)
print("Target:", target)
print("Prediction:", prediction)
input, target = val_ds[0]
predict_single(input, target, model)
input, target = val_ds[10]
predict_single(input, target, model)
input, target = val_ds[23]
predict_single(input, target, model)