티스토리 뷰

머신러닝

kaggle 타이타닉 가공데이터 상관관계 분석

느린 개미 2018. 12. 21. 18:33

타이타닉 DATA 상관관계 분석`

In [55]:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

앞서 EDA 한 Data 를 pickle 로 불러오기

In [56]:

total_set = pd.read_pickle("total_set.pickle")
total_set.head()

Out[56]:

	PassengerId	Pclass	Sex	SibSp	Embarked	name_code	Age_value	Fare_value
0	1	3	0	1	0	0.0	4	0
1	2	1	1	1	1	2.0	7	3
2	3	3	1	0	0	1.0	5	1
3	4	1	1	1	0	2.0	6	3
4	5	3	0	0	0	0.0	6	1

In [57]:

train_data_pre = pd.read_pickle("train_data.pickle")
train_data_pre.head()

Out[57]:

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Embarked	name_code	Fare_value
0	1	0	3	0	22.0	1	0	0.0	0
1	2	1	1	1	38.0	1	1	2.0	3
2	3	1	3	1	26.0	0	0	1.0	1
3	4	1	1	1	35.0	1	0	2.0	3
4	5	0	3	0	35.0	0	0	0.0	1

total_set 에서 train_data, test_data 분리

train_data set

In [58]:

train_data = total_set.iloc[:891, :]
train_data.tail()

Out[58]:

	PassengerId	Pclass	Sex	SibSp	Parch	Embarked	name_code	Age_value	Fare_value
886	887	2	0	0	0	0	0.0	5	1
887	888	1	1	0	0	0	1.0	3	2
888	889	3	1	1	2	0	1.0	4	2
889	890	1	0	0	0	1	0.0	5	2
890	891	3	0	0	0	2	0.0	6	0

In [59]:

train_data.loc[ : , 'Survived'] = train_data_pre.loc[ : , 'Survived']
train_data.tail()

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:337: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Out[59]:

	PassengerId	Pclass	Sex	SibSp	Parch	Embarked	name_code	Age_value	Fare_value	Survived
886	887	2	0	0	0	0	0.0	5	1	0
887	888	1	1	0	0	0	1.0	3	2	1
888	889	3	1	1	2	0	1.0	4	2	0
889	890	1	0	0	0	1	0.0	5	2	1
890	891	3	0	0	0	2	0.0	6	0	0

test data set

In [60]:

test_data = total_set.iloc[891:, :]
test_data.tail()

Out[60]:

	PassengerId	Pclass	Sex	SibSp	Parch	Embarked	name_code	Age_value	Fare_value
413	1305	3	0	0	0	0	0.0	5	1
414	1306	1	1	0	0	1	4.0	7	3
415	1307	3	0	0	0	0	0.0	7	0
416	1308	3	0	0	0	0	0.0	5	1
417	1309	3	0	1	1	1	3.0	5	2

train_data 의 passengerID 열은 단순 index 이므로, drop 한다.
test_data 의 passenger ID 열 kaggle 결과물 제출 시 필요하므로 그대로 둔다.

In [61]:

train_data.drop('PassengerId', axis=1, inplace=True )

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Train data 의 corr 을 살펴본다. Survived 와의 상관관계가 0.0XXX 로 나오는 것은 feature 을 변형해보고자 한다.

In [62]:

train_data.corr()

Out[62]:

	Pclass	Sex	SibSp	Parch	Embarked	name_code	Age_value	Fare_value	Survived
Pclass	1.000000	-0.131900	0.083081	0.018443	0.045702	-0.105629	-0.400822	-0.644180	-0.338481
Sex	-0.131900	1.000000	0.114631	0.245489	0.116569	0.618390	-0.122521	0.239812	0.543351
SibSp	0.083081	0.114631	1.000000	0.414838	-0.059961	0.312732	-0.250593	0.378720	-0.035322
Parch	0.018443	0.245489	0.414838	1.000000	-0.078665	0.388093	-0.185674	0.374659	0.081629
Embarked	0.045702	0.116569	-0.059961	-0.078665	1.000000	0.036684	-0.037095	-0.094764	0.106811
name_code	-0.105629	0.618390	0.312732	0.388093	0.036684	1.000000	-0.227037	0.337250	0.466655
Age_value	-0.400822	-0.122521	-0.250593	-0.185674	-0.037095	-0.227037	1.000000	0.123883	-0.073381
Fare_value	-0.644180	0.239812	0.378720	0.374659	-0.094764	0.337250	0.123883	1.000000	0.306855
Survived	-0.338481	0.543351	-0.035322	0.081629	0.106811	0.466655	-0.073381	0.306855	1.000000

In [63]:

sns.heatmap(train_data.corr(), annot=True, cmap='RdYlGn', linewidth = 0.2 )
fig = plt.gcf()
fig.set_size_inches(10,7)
plt.show()

Alone, BigFamily 열 생성

SibSp 와 Parch 를 합하여 Family_size 열을 생성해준다.

Alone : Family_size 가 0 이면 Alone 은 1, 아니면 0

BigFamily : SibSp 나 Parch 가 3 이상이면 1, 아니면 0

In [64]:

train_data['Family_size'] = 0
train_data.loc[ : ,'Family_size'] = train_data['SibSp'] + train_data['Parch']

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [65]:

train_data.head()

Out[65]:

	Pclass	Sex	SibSp	Embarked	name_code	Age_value	Fare_value	Survived	Family_size
0	3	0	1	0	0.0	4	0	0	1
1	1	1	1	1	2.0	7	3	1	1
2	3	1	0	0	1.0	5	1	1	0
3	1	1	1	0	2.0	6	3	1	1
4	3	0	0	0	0.0	6	1	0	0

In [66]:

train_data['Alone'] = 0
train_data.loc[train_data['Family_size'] ==0 , 'Alone'] =1

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [67]:

train_data['BigFamily'] = 0
train_data.loc[ (train_data['SibSp'] >3 ) | (train_data['Parch'] > 3 ) , 'BigFamily'] = 1

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [68]:

train_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [69]:

train_data.drop(['Family_size'], axis=1, inplace=True)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [70]:

#titanic_data.drop(['Sex'], axis=1, inplace=True)

혼자일 경우 생존율이 낮음을 확인할 수 있다.

In [71]:

sns.factorplot(x='Alone' , y='Survived', data=train_data)
plt.show()

표가 비쌀수록 생존율이 높다

In [72]:

sns.factorplot(x='Fare_value' , y='Survived', data=train_data)
plt.show()

위에서 Age_value 와 Survived 의 상관관계가 낮았기 때문에, Age value 를 조정해보고자 한다.

In [73]:

sns.factorplot(x='Age_value' , y='Survived', data=train_data)
plt.show()

위에를 보면 0~5 세(0)의 생존률이 높고, 65세 이상(13,14,15)의 생존률은 낮다. 15는 1명이다.
나머지 나이대의 생존율은 대략 비슷한 값을 가지고 있다고 보인다.
따라서 0 -> 0, 1~12 -> 1, 13~15 -> 2 의 값으로 재조정해준다.

In [74]:

train_data.Age_value.replace(
    {0:0, 1:1, 2:1, 3:1, 4:1, 5:1,
     6:1, 7:1, 8:1, 9:1, 10:1, 11:1,
     12:1, 13:2, 14:2, 15:2},   inplace = True
)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\generic.py:3924: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

In [75]:

sns.factorplot(x='Age_value' , y='Survived', data=train_data)
plt.show()

다시 상관관계를 살펴본다.

In [76]:

sns.heatmap(train_data.corr(), annot=True, cmap='RdYlGn', linewidth = 0.2 )
fig = plt.gcf()
fig.set_size_inches(10,7)
plt.show()

In [77]:

train_data.head()

Out[77]:

	Pclass	Sex	Embarked	name_code	Age_value	Fare_value	Survived	Alone
0	3	0	0	0.0	1	0	0	0
1	1	1	1	2.0	1	3	1	0
2	3	1	0	1.0	1	1	1	1
3	1	1	0	2.0	1	3	1	0
4	3	0	0	0.0	1	1	0	1

최종적으로 가공한 train_data 를 pickle로 저장한다.

In [88]:

train_data.to_pickle('final_train')

test_data set

In [78]:

#titanic_data_test = pd.read_pickle("titanic_data_test.pickle")
test_data.head()

Out[78]:

	PassengerId	Pclass	Sex	SibSp	Parch	Embarked	name_code	Age_value	Fare_value
0	892	3	0	0	0	2	0.0	6	0
1	893	3	1	1	0	0	2.0	9	0
2	894	2	0	0	0	2	0.0	12	1
3	895	3	0	0	0	0	0.0	5	1
4	896	3	1	1	1	0	2.0	4	1

train data 와 똑같이 가공해준다.

In [79]:

test_data['Family_size'] = test_data['SibSp'] + test_data['Parch']

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [80]:

test_data['Alone'] = 0
test_data.loc[test_data['Family_size'] ==0 , 'Alone'] =1

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [81]:

test_data['BigFamily'] = 0
test_data.loc[ (test_data['SibSp'] >3 ) | (test_data['Parch'] > 3 ) , 'BigFamily'] =1

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [82]:

test_data.drop(['SibSp', 'Parch'], axis=1, inplace=True)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [83]:

test_data.drop(['Family_size'], axis=1, inplace=True)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [521]:

#titanic_data_test.drop(['Sex'], axis=1, inplace=True)

In [84]:

test_data.Age_value.replace(
    {0:0, 1:1, 2:1, 3:1, 4:1, 5:1,
     6:1, 7:1, 8:1, 9:1, 10:1, 11:1,
     12:1, 13:2, 14:2, 15:2},   inplace = True
)

C:\Users\hyejin\Anaconda2\envs\py36\lib\site-packages\pandas\core\generic.py:3924: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

In [85]:

test_data.head()

Out[85]:

	PassengerId	Pclass	Sex	Embarked	name_code	Age_value	Fare_value	Alone
0	892	3	0	2	0.0	1	0	1
1	893	3	1	0	2.0	1	0	0
2	894	2	0	2	0.0	1	1	1
3	895	3	0	0	0.0	1	1	1
4	896	3	1	0	2.0	1	1	0

In [87]:

test_data.to_pickle('final_test')

In [ ]:

'머신러닝' 카테고리의 다른 글

[Geocoder] 주소를 위경도로 변환 (0)	2023.02.27
Gradient Descent(경사하강법) (0)	2023.02.09
[Ensemble_2] RandomForest (랜덤포레스트) (0)	2018.11.15
[Ensemble_1] Bagging (배깅) (0)	2018.11.09
kaggle 타이타닉 EDA (0)	2018.10.25

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

글 보관함

Make your data chart, easily