티스토리 뷰

머신러닝

[데이터전처리_3] Category Data 처리

느린 개미 2018. 10. 17. 02:16
반응형

아래 내용은 인프런 : 밑바닥부터 시작하는 머신러닝 입문 과정의 최성철 교수님 강. 의를 수강하고, 나름대로 한번 정리를 하여 더 오래 기억하고자 작성한 사항입니다.

일부 추가, 삭제, 수정한 사항들도 있습니다.

1. Category Data 처리

  • 다음과 같은 데이터를 어떻게 처리할까? (Green, Blue, Yellow)

그 방법은 One-Hot Encoding . 
{Green, Blue, Yellow} 의 데이터 집합이 있을 때 데이터 set 의 크기만큼 Binary Feature 를 생성

  • {Green} -> [1, 0, 0]
  • {Blue} -> [0, 1, 0]
  • {Yellow} -> [0, 0, 1]
In [41]:
import pandas as pd
import numpy as np
In [42]:
edges = pd.DataFrame({'source':[0, 1, 2],
                      'target':[2, 2, 3],
                      'weight':[3, 4, 5],
                        'color': ['red', 'blue', 'blue']})
edges
Out[42]:
colorsourcetargetweight
0red023
1blue124
2blue235
In [43]:
edges.dtypes
Out[43]:
color     object
source     int64
target     int64
weight     int64
dtype: object
In [44]:
edges["color"]
Out[44]:
0     red
1    blue
2    blue
Name: color, dtype: object
In [45]:
pd.get_dummies(edges) # get_dummies 를 사용하면 문자열 특성만 인코딩되며 숫자 특성은 바뀌지 않음
Out[45]:
sourcetargetweightcolor_bluecolor_red
002301
112410
223510
In [46]:
pd.get_dummies(edges["color"])
Out[46]:
bluered
001
110
210
In [47]:
pd.get_dummies(edges[["color"]]) ## [] 를 하나 더 사용하면 column 명에 _ 가 붙음 
Out[47]:
color_bluecolor_red
001
110
210

weight(Ordinary data) 을 One Hot Encoding 으로 변경해보자

In [48]:
weight_dict = {3:"M", 4:"L", 5:"XL"}
edges["weight_sign"] = edges["weight"].map(weight_dict)
edges
Out[48]:
colorsourcetargetweightweight_sign
0red023M
1blue124L
2blue235XL
In [49]:
weight_sign = pd.get_dummies(edges["weight_sign"])
weight_sign
Out[49]:
LMXL
0010
1100
2001
In [50]:
pd.concat([edges, weight_sign], axis=1)
Out[50]:
colorsourcetargetweightweight_signLMXL
0red023M010
1blue124L100
2blue235XL001
In [51]:
edges
Out[51]:
colorsourcetargetweightweight_sign
0red023M
1blue124L
2blue235XL
In [52]:
pd.get_dummies(edges)
Out[52]:
sourcetargetweightcolor_bluecolor_redweight_sign_Lweight_sign_Mweight_sign_XL
002301010
112410100
223510001
In [53]:
pd.get_dummies(edges).values
Out[53]:
array([[0, 2, 3, 0, 1, 0, 1, 0],
       [1, 2, 4, 1, 0, 1, 0, 0],
       [2, 3, 5, 1, 0, 0, 0, 1]], dtype=int64)

2. Data Binning

  • pd.cut 함수를 이용. 예시는 아래와 같다.
In [54]:
raw_data = {
    'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]
}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
Out[54]:
regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
In [55]:
bins = [0, 25, 50, 75, 100] # 구간 정의 0~25, 25~50, 50~75, 75~100
group_name = ['Low', 'Okay', 'Good', 'Great']
categories = pd.cut(df['postTestScore'], bins, labels=group_name)
categories
Out[55]:
0       Low
1     Great
2      Good
3      Good
4      Good
5       Low
6     Great
7      Good
8      Good
9      Good
10     Good
11     Good
Name: postTestScore, dtype: category
Categories (4, object): [Low < Okay < Good < Great]
In [56]:
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_name)
In [57]:
pd.value_counts(df['categories'])
Out[57]:
Good     8
Great    2
Low      2
Okay     0
Name: categories, dtype: int64
In [58]:
df
Out[58]:
regimentcompanynamepreTestScorepostTestScorecategories
0Nighthawks1stMiller425Low
1Nighthawks1stJacobson2494Great
2Nighthawks2ndAli3157Good
3Nighthawks2ndMilner262Good
4Dragoons1stCooze370Good
5Dragoons1stJacon425Low
6Dragoons2ndRyaner2494Great
7Dragoons2ndSone3157Good
8Scouts1stSloan262Good
9Scouts1stPiger370Good
10Scouts2ndRiani262Good
11Scouts2ndAli370Good
In [59]:
pd.get_dummies(df)
Out[59]:
preTestScorepostTestScoreregiment_Dragoonsregiment_Nighthawksregiment_Scoutscompany_1stcompany_2ndname_Aliname_Coozename_Jacobson...name_Milnername_Pigername_Rianiname_Ryanername_Sloanname_Sonecategories_Lowcategories_Okaycategories_Goodcategories_Great
042501010000...0000001000
1249401010001...0000000001
2315701001100...0000000010
326201001000...1000000010
437010010010...0000000010
542510010000...0000001000
6249410001000...0001000001
7315710001000...0000010010
826200110000...0000100010
937000110000...0100000010
1026200101000...0010000010
1137000101100...0000000010

12 rows × 22 columns

3. Label encoding by sklearn

In [60]:
raw_example = df.as_matrix()
raw_example[:3]
Out[60]:
array([['Nighthawks', '1st', 'Miller', 4, 25, 'Low'],
       ['Nighthawks', '1st', 'Jacobson', 24, 94, 'Great'],
       ['Nighthawks', '2nd', 'Ali', 31, 57, 'Good']], dtype=object)
In [61]:
data = raw_example.copy()
data
Out[61]:
array([['Nighthawks', '1st', 'Miller', 4, 25, 'Low'],
       ['Nighthawks', '1st', 'Jacobson', 24, 94, 'Great'],
       ['Nighthawks', '2nd', 'Ali', 31, 57, 'Good'],
       ['Nighthawks', '2nd', 'Milner', 2, 62, 'Good'],
       ['Dragoons', '1st', 'Cooze', 3, 70, 'Good'],
       ['Dragoons', '1st', 'Jacon', 4, 25, 'Low'],
       ['Dragoons', '2nd', 'Ryaner', 24, 94, 'Great'],
       ['Dragoons', '2nd', 'Sone', 31, 57, 'Good'],
       ['Scouts', '1st', 'Sloan', 2, 62, 'Good'],
       ['Scouts', '1st', 'Piger', 3, 70, 'Good'],
       ['Scouts', '2nd', 'Riani', 2, 62, 'Good'],
       ['Scouts', '2nd', 'Ali', 3, 70, 'Good']], dtype=object)
In [62]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
In [63]:
raw_example[:,0]
Out[63]:
array(['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons',
       'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts',
       'Scouts'], dtype=object)
In [64]:
le.fit(raw_example[:,0])
Out[64]:
LabelEncoder()
In [65]:
le.classes_
Out[65]:
array(['Dragoons', 'Nighthawks', 'Scouts'], dtype=object)
In [66]:
le.transform(raw_example[:,0])
Out[66]:
array([1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], dtype=int64)
In [67]:
data[:,0] =le.transform(raw_example[:,0])
data[:3]
Out[67]:
array([[1, '1st', 'Miller', 4, 25, 'Low'],
       [1, '1st', 'Jacobson', 24, 94, 'Great'],
       [1, '2nd', 'Ali', 31, 57, 'Good']], dtype=object)
In [68]:
label_column = [0, 1, 2, 5]
label_encoder_list = []
for column_index in label_column:
    le = preprocessing.LabelEncoder()
    le.fit(raw_example[:, column_index])
    data[:, column_index] = le.transform(raw_example[:,column_index ])
    label_encoder_list.append(le)
    del le
    
data[:3]
Out[68]:
array([[1, 0, 4, 4, 25, 2],
       [1, 0, 2, 24, 94, 1],
       [1, 1, 0, 31, 57, 0]], dtype=object)
In [69]:
label_encoder_list[0].transform(raw_example[:10, 0])
Out[69]:
array([1, 1, 1, 1, 0, 0, 0, 0, 2, 2], dtype=int64)

4. One-hot encoding by sklearn

In [70]:
one_hot_enc = preprocessing.OneHotEncoder()
data[:,0].reshape(-1,1)
Out[70]:
array([[1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [2],
       [2],
       [2],
       [2]], dtype=object)
In [71]:
one_hot_enc.fit(data[:,0].reshape(-1,1))
Out[71]:
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
In [72]:
one_hot_enc.n_values_
Out[72]:
array([3])
In [73]:
one_hot_enc.active_features_
Out[73]:
array([0, 1, 2], dtype=int64)
In [74]:
onehotlabels = one_hot_enc.transform(data[:,0].reshape(-1,1)).toarray()
In [75]:
onehotlabels
Out[75]:
array([[ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.]])
In [ ]:
 


반응형

'머신러닝' 카테고리의 다른 글

[Ensemble_1] Bagging (배깅)  (0) 2018.11.09
kaggle 타이타닉 EDA  (0) 2018.10.25
[데이터전처리_2] missing value 처리  (0) 2018.10.17
[데이터전처리_1] Feature Scaling  (0) 2018.10.16
Pandas 너 뭐니?_두번째  (0) 2018.06.26