티스토리 뷰

반응형

타이타닉 DATA 상관관계 분석`

In [55]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

앞서 EDA 한 Data 를 pickle 로 불러오기

In [56]:
total_set = pd.read_pickle("total_set.pickle")
total_set.head()
Out[56]:
PassengerIdPclassSexSibSpParchEmbarkedname_codeAge_valueFare_value
01301000.040
12111012.073
23310001.051
34111002.063
45300000.061
In [57]:
train_data_pre = pd.read_pickle("train_data.pickle")
train_data_pre.head()
Out[57]:
PassengerIdSurvivedPclassSexAgeSibSpParchEmbarkedname_codeFare_value
0103022.01000.00
1211138.01012.03
2313126.00001.01
3411135.01002.03
4503035.00000.01

total_set 에서 train_data, test_data 분리

train_data set

In [58]:
train_data = total_set.iloc[:891, :]
train_data.tail()
Out[58]:
PassengerIdPclassSexSibSpParchEmbarkedname_codeAge_valueFare_value
886887200000.051
887888110001.032
888889311201.042
889890100010.052
890891300020.060
In [59]:
train_data.loc[ : , 'Survived'] = train_data_pre.loc[ : , 'Survived']
train_data.tail()
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:337: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
Out[59]:
PassengerIdPclassSexSibSpParchEmbarkedname_codeAge_valueFare_valueSurvived
886887200000.0510
887888110001.0321
888889311201.0420
889890100010.0521
890891300020.0600
  • test data set
In [60]:
test_data = total_set.iloc[891:, :]
test_data.tail()
Out[60]:
PassengerIdPclassSexSibSpParchEmbarkedname_codeAge_valueFare_value
4131305300000.051
4141306110014.073
4151307300000.070
4161308300000.051
4171309301113.052

train_data 의 passengerID 열은 단순 index 이므로, drop 한다. 
test_data 의 passenger ID 열 kaggle 결과물 제출 시 필요하므로 그대로 둔다.

In [61]:
train_data.drop('PassengerId', axis=1, inplace=True )
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Train data 의 corr 을 살펴본다. Survived 와의 상관관계가 0.0XXX 로 나오는 것은 feature 을 변형해보고자 한다.

In [62]:
train_data.corr()
Out[62]:
PclassSexSibSpParchEmbarkedname_codeAge_valueFare_valueSurvived
Pclass1.000000-0.1319000.0830810.0184430.045702-0.105629-0.400822-0.644180-0.338481
Sex-0.1319001.0000000.1146310.2454890.1165690.618390-0.1225210.2398120.543351
SibSp0.0830810.1146311.0000000.414838-0.0599610.312732-0.2505930.378720-0.035322
Parch0.0184430.2454890.4148381.000000-0.0786650.388093-0.1856740.3746590.081629
Embarked0.0457020.116569-0.059961-0.0786651.0000000.036684-0.037095-0.0947640.106811
name_code-0.1056290.6183900.3127320.3880930.0366841.000000-0.2270370.3372500.466655
Age_value-0.400822-0.122521-0.250593-0.185674-0.037095-0.2270371.0000000.123883-0.073381
Fare_value-0.6441800.2398120.3787200.374659-0.0947640.3372500.1238831.0000000.306855
Survived-0.3384810.543351-0.0353220.0816290.1068110.466655-0.0733810.3068551.000000
In [63]:
sns.heatmap(train_data.corr(), annot=True, cmap='RdYlGn', linewidth = 0.2 )
fig = plt.gcf()
fig.set_size_inches(10,7)
plt.show()
  • Alone, BigFamily 열 생성

SibSp 와 Parch 를 합하여 Family_size 열을 생성해준다.

Alone : Family_size 가 0 이면 Alone 은 1, 아니면 0

BigFamily : SibSp 나 Parch 가 3 이상이면 1, 아니면 0

In [64]:
train_data['Family_size'] = 0
train_data.loc[ : ,'Family_size'] = train_data['SibSp'] + train_data['Parch']
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [65]:
train_data.head()
Out[65]:
PclassSexSibSpParchEmbarkedname_codeAge_valueFare_valueSurvivedFamily_size
0301000.04001
1111012.07311
2310001.05110
3111002.06311
4300000.06100
In [66]:
train_data['Alone'] = 0
train_data.loc[train_data['Family_size'] ==0 , 'Alone'] =1
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [67]:
train_data['BigFamily'] = 0
train_data.loc[ (train_data['SibSp'] >3 ) | (train_data['Parch'] > 3 ) , 'BigFamily'] = 1
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [68]:
train_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [69]:
train_data.drop(['Family_size'], axis=1, inplace=True)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [70]:
#titanic_data.drop(['Sex'], axis=1, inplace=True)

혼자일 경우 생존율이 낮음을 확인할 수 있다.

In [71]:
sns.factorplot(x='Alone' , y='Survived', data=train_data)
plt.show()

표가 비쌀수록 생존율이 높다

In [72]:
sns.factorplot(x='Fare_value' , y='Survived', data=train_data)
plt.show()

위에서 Age_value 와 Survived 의 상관관계가 낮았기 때문에, Age value 를 조정해보고자 한다.

In [73]:
sns.factorplot(x='Age_value' , y='Survived', data=train_data)
plt.show()

위에를 보면 0~5 세(0)의 생존률이 높고, 65세 이상(13,14,15)의 생존률은 낮다. 15는 1명이다.
나머지 나이대의 생존율은 대략 비슷한 값을 가지고 있다고 보인다. 
따라서 0 -> 0, 1~12 -> 1, 13~15 -> 2 의 값으로 재조정해준다.

In [74]:
train_data.Age_value.replace(
    {0:0, 1:1, 2:1, 3:1, 4:1, 5:1,
     6:1, 7:1, 8:1, 9:1, 10:1, 11:1,
     12:1, 13:2, 14:2, 15:2},   inplace = True
)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\generic.py:3924: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
In [75]:
sns.factorplot(x='Age_value' , y='Survived', data=train_data)
plt.show()

다시 상관관계를 살펴본다.

In [76]:
sns.heatmap(train_data.corr(), annot=True, cmap='RdYlGn', linewidth = 0.2 )
fig = plt.gcf()
fig.set_size_inches(10,7)
plt.show()
In [77]:
train_data.head()
Out[77]:
PclassSexEmbarkedname_codeAge_valueFare_valueSurvivedAloneBigFamily
03000.010000
11112.013100
23101.011110
31102.013100
43000.011010

최종적으로 가공한 train_data 를 pickle로 저장한다.

In [88]:
train_data.to_pickle('final_train')

test_data set

In [78]:
#titanic_data_test = pd.read_pickle("titanic_data_test.pickle")
test_data.head()
Out[78]:
PassengerIdPclassSexSibSpParchEmbarkedname_codeAge_valueFare_value
0892300020.060
1893311002.090
2894200020.0121
3895300000.051
4896311102.041

train data 와 똑같이 가공해준다.

In [79]:
test_data['Family_size'] = test_data['SibSp'] + test_data['Parch']
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [80]:
test_data['Alone'] = 0
test_data.loc[test_data['Family_size'] ==0 , 'Alone'] =1
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [81]:
test_data['BigFamily'] = 0
test_data.loc[ (test_data['SibSp'] >3 ) | (test_data['Parch'] > 3 ) , 'BigFamily'] =1
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [82]:
test_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [83]:
test_data.drop(['Family_size'], axis=1, inplace=True)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [521]:
#titanic_data_test.drop(['Sex'], axis=1, inplace=True)
In [84]:
test_data.Age_value.replace(
    {0:0, 1:1, 2:1, 3:1, 4:1, 5:1,
     6:1, 7:1, 8:1, 9:1, 10:1, 11:1,
     12:1, 13:2, 14:2, 15:2},   inplace = True
)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\generic.py:3924: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
In [85]:
test_data.head()
Out[85]:
PassengerIdPclassSexEmbarkedname_codeAge_valueFare_valueAloneBigFamily
08923020.01010
18933102.01000
28942020.01110
38953000.01110
48963102.01100
In [87]:
test_data.to_pickle('final_test')
In [ ]:
 


반응형

'머신러닝' 카테고리의 다른 글

[Geocoder] 주소를 위경도로 변환  (0) 2023.02.27
Gradient Descent(경사하강법)  (0) 2023.02.09
[Ensemble_2] RandomForest (랜덤포레스트)  (0) 2018.11.15
[Ensemble_1] Bagging (배깅)  (0) 2018.11.09
kaggle 타이타닉 EDA  (0) 2018.10.25