Haberman Survival Dataset Analysis Using Python
Haberman Dataset Data Analysis and Visualization¶
About Haberman Dataset¶
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Haberman Dataset : https://www.kaggle.com/gilsousa/habermans-survival-data-set
Attribute Information:¶
- Age of patient at time of operation (numerical)
- Patient's year of operation (year - 1900, numerical)
- Number of positive axillary nodes detected (numerical)
- Survival status (class attribute) -- 1 = the patient survived 5 years or longer -- 2 = the patient died within 5 year
Research Question¶
Our Objective is to find the attributes that affect survival status.
- Does age have an affect on survival status?
- Does number of detected auxiliary nodes have an affect on post-op 5 year morality?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Load haberman.csv into a pandas dataFrame.
#Dataset dont have header So adding while reading the file
haberman = pd.read_csv("haberman.csv",names = ["PatientAge", "YearOfOperation",
"AuxNodes", "SurvivalStatus"])
## (Q) How many data-points and featrues are there?
print('Number of rows and columns :' , haberman.shape)
## Print first 5 rows
haberman.head(5)
Data Cleaning¶
#Is any feature with Null values?
haberman.isnull().any()
#data looks clean there are no null values
Understand the data using descriptive statistics¶
First steps is to understand the features of the dataset.
Descriptive statistics — which describes and summarizes data
- Measures of central tendency
- Mean
- Mode
- Median
- Measures of dispersion (also called variability, scatter, or spread)
- Variance
- Standard deviation
- Range
- Interquartile range (IQR)
- Median absolute deviation (MAD)
- Shapes of Distributions
#Generate descriptive statistics that summarize the
#central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values
haberman.describe()
#find median
print('Medin')
print(haberman.median())
#find MAD
print ('MAD')
print(haberman.mad())
# Q. How many people survived >=5 years
print(haberman['SurvivalStatus'].value_counts())
# Q. How many people survived >=5 years interms of percentage
haberman['SurvivalStatus'].value_counts(normalize=True)
#We can see that 73% of the people servived 5 years or longer
Lets Devide the data into two parts based on survived status¶
df_survived = haberman[haberman['SurvivalStatus'] == 1]
df_died = haberman[haberman['SurvivalStatus'] == 2]
print('The patient survived >= 5 years')
print(df_survived.describe())
print('The patient died within 5 year')
print(df_died.describe())
Points to observe¶
The patient survived 5 years or longer
* Mean = 2.79
* 75% = 3.0
The patient died within 5 year
* Mean = 7.45
* 75% = 11.0
Also there is increase in the standard deviation
75% or third quartile(Q3) is the middle value between the median and the highest value of the data set.
- We can say that patients died with in 5 years have more number of positive axillary nodes.
- There is no much difference between PatientAge in both group.
Data Visualization¶
Lets analysie the data using graphs
#1. Plot all features
haberman.plot() ;
plt.show()
#cannot make much sense out it.
# histograms
haberman.hist()
plt.show()
# box plot useful to identify outliers
haberman.plot(kind='box')
plt.show()
#we can see some outliers. point outside the box
#Lets analyise the data using pairplot with recespect to SurvivalStatus
sns.pairplot(haberman, hue="SurvivalStatus", size=3, diag_kind="kde");
plt.show();
Points to Observe¶
- PatientAge in both groups is more or like same.
- AuxNodes seems different. we have already know that the patients died with in 5 years have more number of positive axillary nodes.
#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(haberman, hue="SurvivalStatus", size=10) \
.map(sns.kdeplot, "AuxNodes") \
.add_legend();
plt.show();
Points to observe¶
you can observe long tail to the right side for both groups. lets see the same graph after removing outliers
Remove the outliers from AuxNodes and draw again.¶
#box plot
sns.boxplot(x='SurvivalStatus',y='AuxNodes', data=haberman)
plt.show()
#find IQR = Q3 - Q1
#print('Q3 \n' ,haberman.quantile(0.75))
#print('Q1 \n ' ,haberman.quantile(0.25))
df_iqr = haberman.quantile(0.75) - haberman.quantile(0.25)
print('IQR \n',df_iqr)
aux_low_iqr = haberman['AuxNodes'].quantile(0.25) - (df_iqr['AuxNodes'] * 1.5)
aux_high_iqr = haberman['AuxNodes'].quantile(0.75) + (df_iqr['AuxNodes'] * 1.5)
print('Outlier < Q1 - 1.5*IQR \n ',aux_low_iqr )
print('Outlier > Q3 + 1.5*IQR \n ', aux_high_iqr)
#Filter outliers exists in haberman data set
df_haberman_cleaned = haberman[(haberman['AuxNodes'] > aux_low_iqr) & \
(haberman['AuxNodes'] < aux_high_iqr)]
print('After removing outliers',df_haberman_cleaned.shape)
#Outliers in Auxnodes
df_removed = haberman[(haberman['AuxNodes'] < aux_low_iqr) | \
(haberman['AuxNodes'] > aux_high_iqr)]
print('Removed data shape',df_removed.shape)
#we have removed 40 rows from the dataset.
df_removed
#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(df_haberman_cleaned, hue="SurvivalStatus", size=10) \
.map(sns.kdeplot, "AuxNodes") \
.add_legend();
plt.show();
Now the graph is more accurate and you can observe area under green line more than blue
Conclusion¶
- Does age have an affect on survival status? No
- Does number of detected auxiliary nodes have an affect on post-op 5 year morality? Yes
Next Objective¶
We know that the survival status depends on number of detected auxiliary nodes.
But we are more interested see the other question like below.
Given number of detected auxiliary nodes and past results, can we predict the survival status ?
We will see how to answer above question in upcomming article
titanium arts
ReplyDeleteTATONIC ART λ°μΉ΄λΌμ¬μ΄νΈ CUSTOMING · TATONIC ROCKING T-TATONIC ROCKING T-TATONIC μΆμ₯λ§μ¬μ§ ROCKING T-TATONIC. This unique and λ°μΉ΄λΌ μ¬μ΄νΈ original design is crafted with communitykhabar the use of sustainable mens titanium wedding bands