Haberman Dataset Data Analysis and Visualization¶

About Haberman Dataset¶

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Haberman Dataset : https://www.kaggle.com/gilsousa/habermans-survival-data-set

Attribute Information:¶

Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) -- 1 = the patient survived 5 years or longer -- 2 = the patient died within 5 year

Research Question¶

Our Objective is to find the attributes that affect survival status.

Does age have an affect on survival status?
Does number of detected auxiliary nodes have an affect on post-op 5 year morality?

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Load haberman.csv into a pandas dataFrame. 
#Dataset dont have header So adding while reading the file
haberman = pd.read_csv("haberman.csv",names = ["PatientAge", "YearOfOperation", 
                                               "AuxNodes", "SurvivalStatus"])

## (Q) How many data-points and featrues are there?
print('Number of rows and columns :' , haberman.shape)

## Print first 5 rows
haberman.head(5)

Number of rows and columns : (306, 4)

Data Cleaning¶

#Is any feature with Null values?
haberman.isnull().any()

#data looks clean there are no null values

PatientAge         False
YearOfOperation    False
AuxNodes           False
SurvivalStatus     False
dtype: bool

Find Outliers¶

IQR = Q3 - Q1
Outlier < Q1 - 1.5*IQR
Outlier > Q3 + 1.5*IQR

One of the step in data cleaning is to remove outliers. But just to simply we are excluding this step now. At the end we will see out to remove outliers

Understand the data using descriptive statistics¶

First steps is to understand the features of the dataset.

Descriptive statistics — which describes and summarizes data

Measures of central tendency
- Mean
- Mode
- Median
Measures of dispersion (also called variability, scatter, or spread)
- Variance
- Standard deviation
- Range
- Interquartile range (IQR)
- Median absolute deviation (MAD)
Shapes of Distributions

#Generate descriptive statistics that summarize the 
#central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values
haberman.describe()

#find median
print('Medin')
print(haberman.median())

#find MAD
print ('MAD')
print(haberman.mad())

Medin
PatientAge         52.0
YearOfOperation    63.0
AuxNodes            1.0
SurvivalStatus      1.0
dtype: float64
MAD
PatientAge         8.865180
YearOfOperation    2.787005
AuxNodes           4.790935
SurvivalStatus     0.389273
dtype: float64

# Q. How many people survived >=5 years 
print(haberman['SurvivalStatus'].value_counts())

# Q. How many people survived >=5 years interms of percentage
haberman['SurvivalStatus'].value_counts(normalize=True)

#We can see that 73% of the people servived 5 years or longer

1    225
2     81
Name: SurvivalStatus, dtype: int64

1    0.735294
2    0.264706
Name: SurvivalStatus, dtype: float64

Lets Devide the data into two parts based on survived status¶

df_survived = haberman[haberman['SurvivalStatus'] == 1]
df_died = haberman[haberman['SurvivalStatus'] == 2]

print('The patient survived >= 5 years') 
print(df_survived.describe())

print('The patient died within 5 year') 
print(df_died.describe())

The patient survived >= 5 years
       PatientAge  YearOfOperation    AuxNodes  SurvivalStatus
count  225.000000       225.000000  225.000000           225.0
mean    52.017778        62.862222    2.791111             1.0
std     11.012154         3.222915    5.870318             0.0
min     30.000000        58.000000    0.000000             1.0
25%     43.000000        60.000000    0.000000             1.0
50%     52.000000        63.000000    0.000000             1.0
75%     60.000000        66.000000    3.000000             1.0
max     77.000000        69.000000   46.000000             1.0
The patient died within 5 year
       PatientAge  YearOfOperation   AuxNodes  SurvivalStatus
count   81.000000        81.000000  81.000000            81.0
mean    53.679012        62.827160   7.456790             2.0
std     10.167137         3.342118   9.185654             0.0
min     34.000000        58.000000   0.000000             2.0
25%     46.000000        59.000000   1.000000             2.0
50%     53.000000        63.000000   4.000000             2.0
75%     61.000000        65.000000  11.000000             2.0
max     83.000000        69.000000  52.000000             2.0

Points to observe¶

The patient survived 5 years or longer

* Mean =  2.79 
* 75%  = 3.0

The patient died within 5 year

* Mean =  7.45 
* 75%  = 11.0

Also there is increase in the standard deviation

75% or third quartile(Q3) is the middle value between the median and the highest value of the data set.

We can say that patients died with in 5 years have more number of positive axillary nodes.
There is no much difference between PatientAge in both group.

Data Visualization¶

Lets analysie the data using graphs

#1. Plot all features
haberman.plot() ;
plt.show()
#cannot make much sense out it.

# histograms
haberman.hist()
plt.show()

# box plot useful to identify outliers 
haberman.plot(kind='box')
plt.show()

#we can see some outliers. point outside the box

#Lets analyise the data using pairplot with recespect to SurvivalStatus
sns.pairplot(haberman, hue="SurvivalStatus", size=3, diag_kind="kde");
plt.show();

Points to Observe¶

PatientAge in both groups is more or like same.
AuxNodes seems different. we have already know that the patients died with in 5 years have more number of positive axillary nodes.

#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(haberman, hue="SurvivalStatus", size=10) \
   .map(sns.kdeplot, "AuxNodes") \
   .add_legend();
plt.show();

Points to observe¶

you can observe long tail to the right side for both groups. lets see the same graph after removing outliers

Remove the outliers from AuxNodes and draw again.¶

#box plot 
sns.boxplot(x='SurvivalStatus',y='AuxNodes', data=haberman)
plt.show()

#find IQR = Q3 - Q1
#print('Q3 \n' ,haberman.quantile(0.75))
#print('Q1 \n ' ,haberman.quantile(0.25))

df_iqr = haberman.quantile(0.75) - haberman.quantile(0.25)
print('IQR \n',df_iqr)

aux_low_iqr = haberman['AuxNodes'].quantile(0.25) - (df_iqr['AuxNodes'] * 1.5)
aux_high_iqr = haberman['AuxNodes'].quantile(0.75) + (df_iqr['AuxNodes'] * 1.5)

print('Outlier < Q1 - 1.5*IQR \n ',aux_low_iqr )
print('Outlier > Q3 + 1.5*IQR \n ', aux_high_iqr)


#Filter outliers exists in haberman data set
df_haberman_cleaned = haberman[(haberman['AuxNodes'] > aux_low_iqr) & \
                               (haberman['AuxNodes'] < aux_high_iqr)]
print('After removing outliers',df_haberman_cleaned.shape)

#Outliers in Auxnodes
df_removed = haberman[(haberman['AuxNodes'] < aux_low_iqr) | \
                      (haberman['AuxNodes'] > aux_high_iqr)]
print('Removed data shape',df_removed.shape)

#we have removed 40 rows from the dataset.
df_removed

IQR 
 PatientAge         16.75
YearOfOperation     5.75
AuxNodes            4.00
SurvivalStatus      1.00
dtype: float64
Outlier < Q1 - 1.5*IQR 
  -6.0
Outlier > Q3 + 1.5*IQR 
  10.0
After removing outliers (263, 4)
Removed data shape (40, 4)

#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(df_haberman_cleaned, hue="SurvivalStatus", size=10) \
   .map(sns.kdeplot, "AuxNodes") \
   .add_legend();
plt.show();

Now the graph is more accurate and you can observe area under green line more than blue

Conclusion¶

Does age have an affect on survival status? No
Does number of detected auxiliary nodes have an affect on post-op 5 year morality? Yes

Next Objective¶

We know that the survival status depends on number of detected auxiliary nodes.

But we are more interested see the other question like below.

Given number of detected auxiliary nodes and past results, can we predict the survival status ?

We will see how to answer above question in upcomming article

Search This Blog

[ AtoZ ] Data Science and Machine Learning

Haberman Survival Dataset Analysis Using Python

Haberman Dataset Data Analysis and Visualization¶

About Haberman Dataset¶

Attribute Information:¶

Research Question¶

Data Cleaning¶

Find Outliers¶

Understand the data using descriptive statistics¶

Lets Devide the data into two parts based on survived status¶

Points to observe¶

Data Visualization¶

Points to Observe¶

Points to observe¶

Remove the outliers from AuxNodes and draw again.¶

Conclusion¶

Next Objective¶

Comments

Post a Comment

	PatientAge	YearOfOperation	AuxNodes	SurvivalStatus
count	306.000000	306.000000	306.000000	306.000000
mean	52.457516	62.852941	4.026144	1.264706
std	10.803452	3.249405	7.189654	0.441899
min	30.000000	58.000000	0.000000	1.000000
25%	44.000000	60.000000	0.000000	1.000000
50%	52.000000	63.000000	1.000000	1.000000
75%	60.750000	65.750000	4.000000	2.000000
max	83.000000	69.000000	52.000000	2.000000

	PatientAge	YearOfOperation	AuxNodes	SurvivalStatus
9	34	58	30	1
14	35	64	13	1
22	37	60	15	1
24	38	69	21	2
31	38	66	11	1
43	41	60	23	2
59	42	62	20	1
62	43	58	52	2
66	43	63	14	1
75	44	63	19	2
79	44	67	16	1
85	45	59	14	1
92	46	65	20	2
96	47	63	23	2
106	47	66	12	1
107	48	58	11	2
108	48	58	11	2
124	50	63	13	2
136	51	59	13	2
160	53	63	24	2
161	53	65	12	2
167	54	60	11	2
168	54	65	23	2
174	54	67	46	1
177	54	63	19	1
181	55	68	15	2
185	55	66	18	1
188	55	69	22	1
198	57	62	14	2
215	59	62	35	2
223	60	59	17	2
227	60	61	25	1
238	62	59	13	2
240	62	65	19	2
252	63	61	28	1
254	64	65	22	1
260	65	62	22	2
261	65	66	15	2
269	66	61	13	2
287	70	66	14	1

	PatientAge	YearOfOperation	AuxNodes	SurvivalStatus
9	34	58	30	1
14	35	64	13	1
22	37	60	15	1
24	38	69	21	2
31	38	66	11	1
43	41	60	23	2
59	42	62	20	1
62	43	58	52	2
66	43	63	14	1
75	44	63	19	2
79	44	67	16	1
85	45	59	14	1
92	46	65	20	2
96	47	63	23	2
106	47	66	12	1
107	48	58	11	2
108	48	58	11	2
124	50	63	13	2
136	51	59	13	2
160	53	63	24	2
161	53	65	12	2
167	54	60	11	2
168	54	65	23	2
174	54	67	46	1
177	54	63	19	1
181	55	68	15	2
185	55	66	18	1
188	55	69	22	1
198	57	62	14	2
215	59	62	35	2
223	60	59	17	2
227	60	61	25	1
238	62	59	13	2
240	62	65	19	2
252	63	61	28	1
254	64	65	22	1
260	65	62	22	2
261	65	66	15	2
269	66	61	13	2
287	70	66	14	1

	PatientAge	YearOfOperation	AuxNodes	SurvivalStatus
9	34	58	30	1
14	35	64	13	1
22	37	60	15	1
24	38	69	21	2
31	38	66	11	1
43	41	60	23	2
59	42	62	20	1
62	43	58	52	2
66	43	63	14	1
75	44	63	19	2
79	44	67	16	1
85	45	59	14	1
92	46	65	20	2
96	47	63	23	2
106	47	66	12	1
107	48	58	11	2
108	48	58	11	2
124	50	63	13	2
136	51	59	13	2
160	53	63	24	2
161	53	65	12	2
167	54	60	11	2
168	54	65	23	2
174	54	67	46	1
177	54	63	19	1
181	55	68	15	2
185	55	66	18	1
188	55	69	22	1
198	57	62	14	2
215	59	62	35	2
223	60	59	17	2
227	60	61	25	1
238	62	59	13	2
240	62	65	19	2
252	63	61	28	1
254	64	65	22	1
260	65	62	22	2
261	65	66	15	2
269	66	61	13	2
287	70	66	14	1