Haberman Survival Dataset Analysis Using Python

habermans-survival-analysis

Haberman Dataset Data Analysis and Visualization

About Haberman Dataset

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Haberman Dataset : https://www.kaggle.com/gilsousa/habermans-survival-data-set

Attribute Information:

  1. Age of patient at time of operation (numerical)
  2. Patient's year of operation (year - 1900, numerical)
  3. Number of positive axillary nodes detected (numerical)
  4. Survival status (class attribute) -- 1 = the patient survived 5 years or longer -- 2 = the patient died within 5 year

Research Question

Our Objective is to find the attributes that affect survival status.

  1. Does age have an affect on survival status?
  2. Does number of detected auxiliary nodes have an affect on post-op 5 year morality?
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Load haberman.csv into a pandas dataFrame. 
#Dataset dont have header So adding while reading the file
haberman = pd.read_csv("haberman.csv",names = ["PatientAge", "YearOfOperation", 
                                               "AuxNodes", "SurvivalStatus"])
In [2]:
## (Q) How many data-points and featrues are there?
print('Number of rows and columns :' , haberman.shape)

## Print first 5 rows
haberman.head(5)
Number of rows and columns : (306, 4)
Out[2]:
PatientAge YearOfOperation AuxNodes SurvivalStatus
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1

Data Cleaning

In [3]:
#Is any feature with Null values?
haberman.isnull().any()

#data looks clean there are no null values
Out[3]:
PatientAge         False
YearOfOperation    False
AuxNodes           False
SurvivalStatus     False
dtype: bool

Find Outliers

  • IQR = Q3 - Q1
  • Outlier < Q1 - 1.5*IQR
  • Outlier > Q3 + 1.5*IQR

One of the step in data cleaning is to remove outliers. But just to simply we are excluding this step now. At the end we will see out to remove outliers

Understand the data using descriptive statistics

First steps is to understand the features of the dataset.

Descriptive statistics — which describes and summarizes data

  • Measures of central tendency
    • Mean
    • Mode
    • Median
  • Measures of dispersion (also called variability, scatter, or spread)
    • Variance
    • Standard deviation
    • Range
    • Interquartile range (IQR)
    • Median absolute deviation (MAD)
  • Shapes of Distributions
In [4]:
#Generate descriptive statistics that summarize the 
#central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values
haberman.describe()
Out[4]:
PatientAge YearOfOperation AuxNodes SurvivalStatus
count 306.000000 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144 1.264706
std 10.803452 3.249405 7.189654 0.441899
min 30.000000 58.000000 0.000000 1.000000
25% 44.000000 60.000000 0.000000 1.000000
50% 52.000000 63.000000 1.000000 1.000000
75% 60.750000 65.750000 4.000000 2.000000
max 83.000000 69.000000 52.000000 2.000000
In [5]:
#find median
print('Medin')
print(haberman.median())

#find MAD
print ('MAD')
print(haberman.mad())
Medin
PatientAge         52.0
YearOfOperation    63.0
AuxNodes            1.0
SurvivalStatus      1.0
dtype: float64
MAD
PatientAge         8.865180
YearOfOperation    2.787005
AuxNodes           4.790935
SurvivalStatus     0.389273
dtype: float64
In [6]:
# Q. How many people survived >=5 years 
print(haberman['SurvivalStatus'].value_counts())

# Q. How many people survived >=5 years interms of percentage
haberman['SurvivalStatus'].value_counts(normalize=True)

#We can see that 73% of the people servived 5 years or longer
1    225
2     81
Name: SurvivalStatus, dtype: int64
Out[6]:
1    0.735294
2    0.264706
Name: SurvivalStatus, dtype: float64

Lets Devide the data into two parts based on survived status

In [7]:
df_survived = haberman[haberman['SurvivalStatus'] == 1]
df_died = haberman[haberman['SurvivalStatus'] == 2]

print('The patient survived >= 5 years') 
print(df_survived.describe())

print('The patient died within 5 year') 
print(df_died.describe())
The patient survived >= 5 years
       PatientAge  YearOfOperation    AuxNodes  SurvivalStatus
count  225.000000       225.000000  225.000000           225.0
mean    52.017778        62.862222    2.791111             1.0
std     11.012154         3.222915    5.870318             0.0
min     30.000000        58.000000    0.000000             1.0
25%     43.000000        60.000000    0.000000             1.0
50%     52.000000        63.000000    0.000000             1.0
75%     60.000000        66.000000    3.000000             1.0
max     77.000000        69.000000   46.000000             1.0
The patient died within 5 year
       PatientAge  YearOfOperation   AuxNodes  SurvivalStatus
count   81.000000        81.000000  81.000000            81.0
mean    53.679012        62.827160   7.456790             2.0
std     10.167137         3.342118   9.185654             0.0
min     34.000000        58.000000   0.000000             2.0
25%     46.000000        59.000000   1.000000             2.0
50%     53.000000        63.000000   4.000000             2.0
75%     61.000000        65.000000  11.000000             2.0
max     83.000000        69.000000  52.000000             2.0

Points to observe

The patient survived 5 years or longer

* Mean =  2.79 
* 75%  = 3.0

The patient died within 5 year

* Mean =  7.45 
* 75%  = 11.0 

Also there is increase in the standard deviation

75% or third quartile(Q3) is the middle value between the median and the highest value of the data set.

  • We can say that patients died with in 5 years have more number of positive axillary nodes.
  • There is no much difference between PatientAge in both group.

Data Visualization

Lets analysie the data using graphs

In [8]:
#1. Plot all features
haberman.plot() ;
plt.show()
#cannot make much sense out it. 
In [9]:
# histograms
haberman.hist()
plt.show()
In [10]:
# box plot useful to identify outliers 
haberman.plot(kind='box')
plt.show()

#we can see some outliers. point outside the box
In [11]:
#Lets analyise the data using pairplot with recespect to SurvivalStatus
sns.pairplot(haberman, hue="SurvivalStatus", size=3, diag_kind="kde");
plt.show();
Points to Observe
  • PatientAge in both groups is more or like same.
  • AuxNodes seems different. we have already know that the patients died with in 5 years have more number of positive axillary nodes.
In [12]:
#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(haberman, hue="SurvivalStatus", size=10) \
   .map(sns.kdeplot, "AuxNodes") \
   .add_legend();
plt.show();

Points to observe

you can observe long tail to the right side for both groups. lets see the same graph after removing outliers

Remove the outliers from AuxNodes and draw again.
In [13]:
#box plot 
sns.boxplot(x='SurvivalStatus',y='AuxNodes', data=haberman)
plt.show()
In [14]:
#find IQR = Q3 - Q1
#print('Q3 \n' ,haberman.quantile(0.75))
#print('Q1 \n ' ,haberman.quantile(0.25))

df_iqr = haberman.quantile(0.75) - haberman.quantile(0.25)
print('IQR \n',df_iqr)

aux_low_iqr = haberman['AuxNodes'].quantile(0.25) - (df_iqr['AuxNodes'] * 1.5)
aux_high_iqr = haberman['AuxNodes'].quantile(0.75) + (df_iqr['AuxNodes'] * 1.5)

print('Outlier < Q1 - 1.5*IQR \n ',aux_low_iqr )
print('Outlier > Q3 + 1.5*IQR \n ', aux_high_iqr)


#Filter outliers exists in haberman data set
df_haberman_cleaned = haberman[(haberman['AuxNodes'] > aux_low_iqr) & \
                               (haberman['AuxNodes'] < aux_high_iqr)]
print('After removing outliers',df_haberman_cleaned.shape)

#Outliers in Auxnodes
df_removed = haberman[(haberman['AuxNodes'] < aux_low_iqr) | \
                      (haberman['AuxNodes'] > aux_high_iqr)]
print('Removed data shape',df_removed.shape)

#we have removed 40 rows from the dataset.
df_removed
IQR 
 PatientAge         16.75
YearOfOperation     5.75
AuxNodes            4.00
SurvivalStatus      1.00
dtype: float64
Outlier < Q1 - 1.5*IQR 
  -6.0
Outlier > Q3 + 1.5*IQR 
  10.0
After removing outliers (263, 4)
Removed data shape (40, 4)
Out[14]:
PatientAge YearOfOperation AuxNodes SurvivalStatus
9 34 58 30 1
14 35 64 13 1
22 37 60 15 1
24 38 69 21 2
31 38 66 11 1
43 41 60 23 2
59 42 62 20 1
62 43 58 52 2
66 43 63 14 1
75 44 63 19 2
79 44 67 16 1
85 45 59 14 1
92 46 65 20 2
96 47 63 23 2
106 47 66 12 1
107 48 58 11 2
108 48 58 11 2
124 50 63 13 2
136 51 59 13 2
160 53 63 24 2
161 53 65 12 2
167 54 60 11 2
168 54 65 23 2
174 54 67 46 1
177 54 63 19 1
181 55 68 15 2
185 55 66 18 1
188 55 69 22 1
198 57 62 14 2
215 59 62 35 2
223 60 59 17 2
227 60 61 25 1
238 62 59 13 2
240 62 65 19 2
252 63 61 28 1
254 64 65 22 1
260 65 62 22 2
261 65 66 15 2
269 66 61 13 2
287 70 66 14 1
In [15]:
#Seaborn plot of AuxNodes's PDF.
sns.FacetGrid(df_haberman_cleaned, hue="SurvivalStatus", size=10) \
   .map(sns.kdeplot, "AuxNodes") \
   .add_legend();
plt.show();

Now the graph is more accurate and you can observe area under green line more than blue

Conclusion

  1. Does age have an affect on survival status? No
  2. Does number of detected auxiliary nodes have an affect on post-op 5 year morality? Yes
Next Objective

We know that the survival status depends on number of detected auxiliary nodes.

But we are more interested see the other question like below.

Given number of detected auxiliary nodes and past results, can we predict the survival status ?

We will see how to answer above question in upcomming article

Comments

  1. titanium arts
    TATONIC ART λ°”μΉ΄λΌμ‚¬μ΄νŠΈ CUSTOMING · TATONIC ROCKING T-TATONIC ROCKING T-TATONIC 좜μž₯λ§ˆμ‚¬μ§€ ROCKING T-TATONIC. This unique and 바카라 μ‚¬μ΄νŠΈ original design is crafted with communitykhabar the use of sustainable mens titanium wedding bands

    ReplyDelete

Post a Comment