2. 의사결정나무(decision Tree) _ 브랜치 나누기 실습¶

인프런의 밑바닥부터 시작하는 머신러닝 입문 최성철 교수님 강의를 기본으로 작성되었습니다.
일부코드는 수정되었습니다.

(1)시작¶

아래 14개의 데이터를 가지고 computer 구입 여부를 판단할 수 있는 decision Tree 를 만든다.
데이터 속성은 age, income, student, credit_rating 이 있다. Information Gain 값을 계산하여 branch 를 나눈다. 참고

pd_data

결론적으로 총 14개의 데이터를 나누는 첫번째 속성은 age 가 된다.
그 후 age 의 branch(youth, middle_aged, senior) 들에서 다시 branch를 나누는 속성을 구해보도록 하겠다

(2)첫번째 속성 찾기¶

import pandas as pd 
import numpy as np

pd_data_original = pd.read_csv('https://raw.githubusercontent.com/AugustLONG/ML01/master/01decisiontree/AllElectronics.csv')
pd_data = pd_data_original.drop("RID",axis=1) #필요없는 data 라 drop 함
pd_data

buy = pd_data.loc[pd_data['class_buys_computer'] == 'yes']
not_buy = pd_data.loc[pd_data['class_buys_computer'] == 'no']
buy

x = np.array([len(buy) / len(pd_data),len(not_buy) / len(pd_data)])
y = np.log2(x)

result = -sum(x*y)
result

0.94028595867063114

위와 같이 전체 데이터에 대한 엔트로피는 0.94028595867063114 이다.
엔트로피를 구하는 함수를 아래와 같이 만들었다.

def get_info_buy_or_not(df):
    
    buy = df.loc[pd_data['class_buys_computer'] == 'yes']
    not_buy = df.loc[pd_data['class_buys_computer'] == 'no']
    
    x = np.array([len(buy) / len(df),len(not_buy) / len(df)])   
    y = np.log2(x[x!=0])
    

    result = -sum(x[x!=0]*y)
    
    return result

print(get_info_buy_or_not(pd_data.loc[pd_data['age'] == 'youth']))
print(get_info_buy_or_not(pd_data.loc[pd_data['age'] == 'senior']))

0.970950594455
0.970950594455

print(get_info_buy_or_not(pd_data.loc[pd_data['age'] == 'middle_aged']))

-0.0

속성별 Entropy 를 구하는 함수

def get_attribute_information_gain(df,attribute):
    
    values = df[attribute].unique()
    
    print(values)
    get_infos = []
    
    for value in values:
        split_data = df.loc[df[attribute]== value]
        #print(get_info_buy_or_not(split_data))
        get_infos.append(len(split_data) /len(df) * get_info_buy_or_not(split_data))
        #print(get_infos)
        
    return sum(get_infos)

get_attribute_information_gain(pd_data, 'age')

['youth' 'middle_aged' 'senior']

0.69353613889619181

age 속성에 대한 Information Gain

get_info_buy_or_not(pd_data) - get_attribute_information_gain(pd_data, "age")

['youth' 'middle_aged' 'senior']

0.24674981977443933

income 속성에 대한 Information Gain

get_info_buy_or_not(pd_data) - get_attribute_information_gain(pd_data, "income")

['high' 'medium' 'low']

0.029222565658954869

credit_rating 속성에 대한 Information Gain

get_info_buy_or_not(pd_data) - get_attribute_information_gain(pd_data, "credit_rating")

['fair' 'excellent']

0.048127030408269489

student 속성에 대한 Information Gain

get_info_buy_or_not(pd_data) - get_attribute_information_gain(pd_data, "student")

['no' 'yes']

0.15183550136234159

Information Gain 이 가장 높은 속성은 age 이다.
따라서 첫번째 분기 속성은 age 로 선택된다.

(3_1) age 의 youth 브랜치 내에 다음 분기 속성 찾기¶

youth = pd_data.loc[pd_data['age']== 'youth']
youth

get_info_buy_or_not(youth) - get_attribute_information_gain(youth, "income")

['high' 'medium' 'low']

0.57095059445466856

get_info_buy_or_not(youth) - get_attribute_information_gain(youth, "student")

['no' 'yes']

0.97095059445466858

get_info_buy_or_not(youth) - get_attribute_information_gain(youth, "credit_rating")

['fair' 'excellent']

0.019973094021974891

Information Gain 이 가장 높은 속성은 student 이다.
따라서 age 의 youth 브랜치 내에 다음 분기 속성은 student 로 선택된다.

(3_2) age 의 senior 브랜치 내에 다음 분기 속성 찾기¶

senior = pd_data.loc[pd_data['age']== 'senior']
senior

get_info_buy_or_not(senior) - get_attribute_information_gain(senior, "income")

['medium' 'low']

0.019973094021974891

get_info_buy_or_not(senior) - get_attribute_information_gain(senior, "student")

['no' 'yes']

0.019973094021974891

get_info_buy_or_not(senior) - get_attribute_information_gain(senior, "credit_rating")

['fair' 'excellent']

0.97095059445466858

Information Gain 이 가장 높은 속성은 credit_rating 이다.
따라서 age 의 senior 브랜치 내에 다음 분기 속성은 credit_rating 로 선택된다.

(3_3) age 의 middle_aged 브랜치 내에 다음 분기 속성 찾기¶

middle_aged = pd_data.loc[pd_data['age']== 'middle_aged']
middle_aged

위의 데이터를 보면 middle_aged 에서는 다 컴퓨터구입 여부가 yes 이다. 따로 브랜치를 만들 필요가 없다.

get_info_buy_or_not(middle_aged) - get_attribute_information_gain(middle_aged, "income")

['high' 'low' 'medium']

-0.0

get_info_buy_or_not(middle_aged) - get_attribute_information_gain(middle_aged, "student")

['no' 'yes']

-0.0

get_info_buy_or_not(middle_aged) - get_attribute_information_gain(middle_aged, "credit_rating")

['fair' 'excellent']

-0.0

numpy 를 이해해보자 (2)	2018.06.01
머신러닝 분류 (0)	2018.05.16
의사결정나무(decisiontree)_1 (0)	2018.05.12
iris data 를 이용한 KNN 구현해보기_2 (0)	2018.05.08
iris data 를 이용한 KNN 구현해보기_1 (0)	2018.05.04

Make your data chart, easily

티스토리 뷰

의사결정나무(decisiontree)_2