티스토리 뷰

머신러닝

Pandas 너 뭐니?_첫번째

느린 개미 2018. 6. 10. 16:38

아래 Pandas 관련 내용은 인프런 : 밑바닥부터 시작하는 머신러닝 입문 과정의 최성철 교수님 강의의 pandas 부분을 수강하고, 나름대로 한번 정리를 하여 더 오래 기억하고자 작성한 사항입니다.

일부 추가, 삭제, 수정한 사항들도 있습니다.

1. Pandas 는?

구조화된 데이터의 처리를 지원하는 Python 라이브러리
고성능 Array 계산 라이브러리인 Numpy 와 통합하여, 강력한 "스프레드시트" 처리 기능을 제공
Python계의 엑셀!!

2. 데이터 로딩

In [146]:

import pandas as pd

In [147]:

data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' #Data URL
df_data = pd.read_csv(data_url, sep='\s+', header = None) #csv 타입 데이터 로드, separate는 빈공간으로 지정하고, Column은 없음

In [148]:

df_data.head() #처음 다섯줄 출력

Out[148]:

	0	1	2	4	5	6	7	8	9	10	11	12	13
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222.0	18.7	396.90	5.33	36.2

3. Series

DataFrame 중 하나의 Column 에 해당하는 데이터의 모음 Object

In [149]:

from pandas import Series, DataFrame
import numpy as np

In [150]:

example_obj = Series()

In [151]:

list_data = [1, 2, 3, 4, 5]
example_obj = Series(data = list_data)
example_obj

Out[151]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

index : index 이름을 지정
dtype : data type 설정
name : series 이름 설정

In [152]:

list_data = [1, 2, 3, 4, 5]
list_name = ["a", "b", "c", "d", "e"]
example_obj = Series(data = list_data, index = list_name, dtype = np.float32, name = "example_data") #index 이름을 지정
example_obj

Out[152]:

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32

바로 dict type 할당도 가능

In [153]:

dict_data = {"a":1, "b":2, "c":3, "d":4, "e":5}
example_obj = Series(dict_data, dtype = np.float32, name = "example_data") #index 이름을 지정
example_obj

Out[153]:

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32

data index 에 접근하기 & data index에 값 할당하기

In [154]:

example_obj["a"]

Out[154]:

1.0

In [155]:

example_obj["a"] = 3.2
example_obj

Out[155]:

a    3.2
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32

깂 리스트만 & index 리스트만

In [156]:

example_obj.values

Out[156]:

array([ 3.20000005,  2.        ,  3.        ,  4.        ,  5.        ], dtype=float32)

In [157]:

example_obj.index

Out[157]:

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

index 값을 기준으로 Series 생성

In [158]:

dict_data_1 = {"a":1, "b":2, "c":3, "d":4, "e":5}
indexes = ["a", "b", "c", "d", "e", "f", "g", "h"]
series_obj_1 = Series(dict_data_1, index=indexes)
series_obj_1

Out[158]:

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    NaN
g    NaN
h    NaN
dtype: float64

4. Dataframe Overview

각 column 들은 다른 데이터 타입을 가질 수 있다.
row 와 column index 가 있다.
Series 를 모아서 만든 Data Table = 기본 2차원 이다.

In [159]:

# Example from - https://chrisalbon.com/python/pandas_map_values_to_values.html
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])
df

Out[159]:

	first_name	last_name	age	city
0	Jason	Miller	42	San Francisco
1	Molly	Jacobson	52	Baltimore
2	Tina	Ali	36	Miami
3	Jake	Milner	24	Douglas
4	Amy	Cooze	73	Boston

column 선택

In [160]:

DataFrame(raw_data, columns = ["age", "city"])

Out[160]:

	age	city
0	42	San Francisco
1	52	Baltimore
2	36	Miami
3	24	Douglas
4	73	Boston

새로운 column 추가

In [161]:

df =DataFrame(raw_data, 
          columns = ["first_name","last_name","age", "city", "debt"]
         )

In [162]:

df.debt

Out[162]:

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: debt, dtype: object

In [163]:

df["debt"]

Out[163]:

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: debt, dtype: object

ioc : index location
iloc : index position

In [164]:

df.head()

Out[164]:

	first_name	last_name	age	city	debt
0	Jason	Miller	42	San Francisco	NaN
1	Molly	Jacobson	52	Baltimore	NaN
2	Tina	Ali	36	Miami	NaN
3	Jake	Milner	24	Douglas	NaN
4	Amy	Cooze	73	Boston	NaN

In [165]:

df.loc[1]

Out[165]:

first_name        Molly
last_name      Jacobson
age                  52
city          Baltimore
debt                NaN
Name: 1, dtype: object

In [166]:

df["age"].iloc[1:]

Out[166]:

1    52
2    36
3    24
4    73
Name: age, dtype: int64

Column에 새로운 데이터 할당

In [167]:

df.debt = df.age > 40
df

Out[167]:

	first_name	last_name	age	city	debt
0	Jason	Miller	42	San Francisco	True
1	Molly	Jacobson	52	Baltimore	True
2	Tina	Ali	36	Miami	False
3	Jake	Milner	24	Douglas	False
4	Amy	Cooze	73	Boston	True

Transpose

In [168]:

df.T

Out[168]:

	0	1	2	3	4
first_name	Jason	Molly	Tina	Jake	Amy
last_name	Miller	Jacobson	Ali	Milner	Cooze
age	42	52	36	24	73
city	San Francisco	Baltimore	Miami	Douglas	Boston
debt	True	True	False	False	True

값 출력

In [169]:

df.values

Out[169]:

array([['Jason', 'Miller', 42, 'San Francisco', True],
       ['Molly', 'Jacobson', 52, 'Baltimore', True],
       ['Tina', 'Ali', 36, 'Miami', False],
       ['Jake', 'Milner', 24, 'Douglas', False],
       ['Amy', 'Cooze', 73, 'Boston', True]], dtype=object)

column 을 삭제함

In [170]:

del df["debt"]
df

Out[170]:

	first_name	last_name	age	city
0	Jason	Miller	42	San Francisco
1	Molly	Jacobson	52	Baltimore
2	Tina	Ali	36	Miami
3	Jake	Milner	24	Douglas
4	Amy	Cooze	73	Boston

5. Selection & Drop

한개의 column 선택 시

In [171]:

df["first_name"].head(3)

Out[171]:

0    Jason
1    Molly
2     Tina
Name: first_name, dtype: object

1개 이상의 column 선택

In [172]:

df [["first_name", "last_name", "age"]].head(3)

Out[172]:

	first_name	last_name	age
0	Jason	Miller	42
1	Molly	Jacobson	52
2	Tina	Ali	36

index 재설정

In [173]:

import numpy as np
df = pd.read_excel("excel-comp-data.xlsx")
df.head()

Out[173]:

	account	name	street	city	state	postal-code	Jan	Feb	Mar
0	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	10000	62000	35000
1	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000
2	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000
3	109996	D'Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000
4	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000

In [174]:

df.index = list(range(10,25))
df.head()

Out[174]:

	account	name	street	city	state	postal-code	Jan	Feb	Mar
10	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	10000	62000	35000
11	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000
12	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000
13	109996	D'Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000
14	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000

Data drop

index number 로 drop

In [175]:

df.drop(10).head()

Out[175]:

	account	name	street	city	state	postal-code	Jan	Feb	Mar
11	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000
12	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000
13	109996	D'Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000
14	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000
15	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	150000	120000	35000

한개 이상의 index number로 drop

In [176]:

df.drop([10,11,12,13,14]).head()

Out[176]:

	account	name	street	city	state	postal-code	Jan	Feb	Mar
15	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	150000	120000	35000
16	145068	Casper LLC	340 Consuela Bridge Apt. 400	Lake Gabriellaton	Mississipi	18008	62000	120000	70000
17	205217	Kovacek-Johnston	91971 Cronin Vista Suite 601	Deronville	RhodeIsland	53461	145000	95000	35000
18	209744	Champlin-Morar	26739 Grant Lock	Lake Juliannton	Pennsylvania	64415	70000	95000	35000
19	212303	Gerhold-Maggio	366 Maggio Grove Apt. 998	North Ras	Idaho	46308	70000	120000	35000

axis 지정으로 축을 기준으로 drop => column 중에 "city"

In [177]:

df.drop("city", axis=1).head()

Out[177]:

	account	name	street	state	postal-code	Jan	Feb	Mar
10	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	Texas	28752	10000	62000	35000
11	320563	Walter-Trantow	1311 Alvis Tunnel	NorthCarolina	38365	95000	45000	35000
12	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	Iowa	76517	91000	120000	35000
13	109996	D'Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Maine	46021	45000	120000	10000
14	121213	Bauch-Goldner	7274 Marissa Common	California	49681	162000	120000	35000

6. Dataframe Operations

Series operation

index 를 기준으로 연산수행. 겹치는 index 가 없을 경우 NaN값으로 반환

In [178]:

s1 = Series(
    range(1,6), index=list("abced"))
s1

Out[178]:

a    1
b    2
c    3
e    4
d    5
dtype: int32

In [179]:

s2 = Series(
    range(5,11), index=list("bcedef"))
s2

Out[179]:

b     5
c     6
e     7
d     8
e     9
f    10
dtype: int32

In [180]:

s1 + s2

Out[180]:

a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN
dtype: float64

Dataframe operation

In [181]:

df1 = DataFrame(
    np.arange(9).reshape(3,3), 
    columns=list("abc"))
df1

Out[181]:

	a	b	c
0	0	1	2
1	3	4	5
2	6	7	8

In [182]:

df2 = DataFrame(
    np.arange(16).reshape(4,4), 
    columns=list("abcd"))
df2

Out[182]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

In [183]:

df1 + df2

Out[183]:

	a	b	c	d
0	0.0	2.0	4.0	NaN
1	7.0	9.0	11.0	NaN
2	14.0	16.0	18.0	NaN
3	NaN	NaN	NaN	NaN

In [184]:

df1.add(df2,fill_value=0)

Out[184]:

	a	b	c	d
0	0.0	2.0	4.0	3.0
1	7.0	9.0	11.0	7.0
2	14.0	16.0	18.0	11.0
3	12.0	13.0	14.0	15.0

7. Lambda, Map 함수

Lambda

한 줄로 함수를 표현하는 익명 함수 기법
lambda argument : expression

map 함수

map(function, sequence)
sequence 데이터에 모두 function 적용

두 개 이상의 argument 가 있을 때는 두 개의 sequence 형을 써야함

In [185]:

ex = [1,2,3,4,5]
f = lambda x,y : x+y
list(map(f, ex, ex))

Out[185]:

[2, 4, 6, 8, 10]

익명 함수 그대로 사용할 수 있음

In [186]:

list(map(lambda x: x+x, ex))

Out[186]:

[2, 4, 6, 8, 10]

map for Series

Pandas 의 series type 의 데이터에도 map 함수 사용 가능

In [187]:

s1 = Series(np.arange(10))
s1.head()

Out[187]:

0    0
1    1
2    2
3    3
4    4
dtype: int32

In [188]:

s1.map(lambda x: x**2).head(5)

Out[188]:

0     0
1     1
2     4
3     9
4    16
dtype: int64

Dict type으로 데이터 교체. 없는 값은 NaN

In [216]:

z = {1: 'A', 2: 'B', 3: 'C'}
s1.map(z).head(5)

Out[216]:

0    NaN
1      A
2      B
3      C
4    NaN
dtype: object

같은 위치의 데이터를 s2 로 전환

In [190]:

s2 = Series(np.arange(10, 20))
s1.map(s2).head(5)

Out[190]:

0    10
1    11
2    12
3    13
4    14
dtype: int32

아래와 같이 unique 함수를 이용해 명목형 변수를 숫자로 변환하는 방법으로도 많이 사용함

In [191]:

df = pd.read_csv("wages.csv")
df.head()

Out[191]:

	earn	height	sex	race	ed	age
0	79571.299011	73.89	male	white	16	49
1	96396.988643	66.23	female	white	16	62
2	48710.666947	63.77	female	white	16	33
3	80478.096153	63.22	female	other	16	95
4	82089.345498	63.08	female	white	17	43

In [192]:

df.sex.unique()

Out[192]:

array(['male', 'female'], dtype=object)

In [193]:

df["sex_code"] = df.sex.map({"male":0, "female":1})
df.head(5)

Out[193]:

	earn	height	sex	race	ed	age	sex_code
0	79571.299011	73.89	male	white	16	49	0
1	96396.988643	66.23	female	white	16	62	1
2	48710.666947	63.77	female	white	16	33	1
3	80478.096153	63.22	female	other	16	95	1
4	82089.345498	63.08	female	white	17	43	1

Replace function

Map 함수의 기능 중 데이터 변환 기능만 담당
데이터 변환 시 많이 사용하는 함수

In [217]:

df.sex.replace(
    {"male":0, "female":1}
).head()

Out[217]:

0    0
1    1
2    1
3    1
4    1
Name: sex, dtype: int64

In [218]:

df.sex.replace(
          ["male", "female"], [0, 1], inplace = True)
df.head(5)

Out[218]:

	earn	height	sex	race	ed	age
0	79571.299011	73.89	0	0	16	49
1	96396.988643	66.23	1	0	16	62
2	48710.666947	63.77	1	0	16	33
3	80478.096153	63.22	1	1	16	95
4	82089.345498	63.08	1	0	17	43

apply for dataframe

map 과 달리, series 전체(column)에 해당 함수를 적용
입력값이 series 데이터로 입력받아 handling 가능
각 column 별로 결과값 반환

In [196]:

df_info = df[["earn", "height", "age"]]
df_info.head()

Out[196]:

	earn	height	age
0	79571.299011	73.89	49
1	96396.988643	66.23	62
2	48710.666947	63.77	33
3	80478.096153	63.22	95
4	82089.345498	63.08	43

In [197]:

f = lambda x : x.max()-x.min()
df_info.apply(f)

Out[197]:

earn      318047.708444
height        19.870000
age           73.000000
dtype: float64

내장 연산 함수를 사용할 때도 똑같은 효과를 거둘 수 있음 (mean, std 등 사용 가능)

In [198]:

df_info.sum()

Out[198]:

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

In [199]:

df_info.apply(sum)

Out[199]:

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

applymap for dataframe

series 단위가 아닌 element 단위로 함수를 적용함
series 단위에 apply 를 적용시킬 때와 같은 효과

In [200]:

f = lambda x: -x
df_info.applymap(f).head()

Out[200]:

	earn	height	age
0	-79571.299011	-73.89	-49
1	-96396.988643	-66.23	-62
2	-48710.666947	-63.77	-33
3	-80478.096153	-63.22	-95
4	-82089.345498	-63.08	-43

8. Pandas Built-in functions

describe

Numeric type 데이터의 요약 정보를 보여줌

In [201]:

df = pd.read_csv("wages.csv")
df.head()

Out[201]:

	earn	height	sex	race	ed	age
0	79571.299011	73.89	male	white	16	49
1	96396.988643	66.23	female	white	16	62
2	48710.666947	63.77	female	white	16	33
3	80478.096153	63.22	female	other	16	95
4	82089.345498	63.08	female	white	17	43

In [202]:

df.describe()

Out[202]:

	earn	height	ed	age
count	1379.000000	1379.000000	1379.000000	1379.000000
mean	32446.292622	66.592640	13.354605	45.328499
std	31257.070006	3.818108	2.438741	15.789715
min	-98.580489	57.340000	3.000000	22.000000
25%	10538.790721	63.720000	12.000000	33.000000
50%	26877.870178	66.050000	13.000000	42.000000
75%	44506.215336	69.315000	15.000000	55.000000
max	317949.127955	77.210000	18.000000	95.000000

Unique

series data의 유일한 값을 list 로 반환함

In [203]:

df.race.unique()

Out[203]:

array(['white', 'other', 'hispanic', 'black'], dtype=object)

index를 추출할 때도 사용

In [204]:

np.array(dict(enumerate(df.race.unique())))

Out[204]:

array({0: 'white', 1: 'other', 2: 'hispanic', 3: 'black'}, dtype=object)

In [205]:

value = list(map(int,np.array(list(enumerate(df.race.unique())))[:, 0]))
value

Out[205]:

[0, 1, 2, 3]

In [206]:

key = np.array(list(enumerate(df.race.unique())))[: , 1].tolist()
key

Out[206]:

['white', 'other', 'hispanic', 'black']

index labelling (race 열을 0-3 까지의 숫자값으로 변환)

In [207]:

df["race"].replace(to_replace=key, value = value, inplace=True)

In [208]:

df.head()

Out[208]:

	earn	height	sex	race	ed	age
0	79571.299011	73.89	male	0	16	49
1	96396.988643	66.23	female	0	16	62
2	48710.666947	63.77	female	0	16	33
3	80478.096153	63.22	female	1	16	95
4	82089.345498	63.08	female	0	17	43

isnull

column 또는 row 값의 NaN (null) 값의 index 를 반환함

In [209]:

df.isnull().head()

Out[209]:

	earn	height	sex	race	ed	age
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	False	False	False
4	False	False	False	False	False	False

In [210]:

df.isnull().sum()  #Null 인 값의 합

Out[210]:

earn      0
height    0
sex       0
race      0
ed        0
age       0
dtype: int64

sort_values

column 값을 기준으로 데이터를 sorting
ascending : 오름차순

In [211]:

df.sort_values(["age", "earn"], ascending=True ).head(10)

Out[211]:

	earn	height	sex	race	ed	age
1038	-56.321979	67.81	male	2	10	22
800	-27.876819	72.29	male	0	12	22
963	-25.655260	68.90	male	0	12	22
1105	988.565070	64.71	female	0	12	22
801	1000.221504	64.09	female	0	12	22
862	1002.023843	66.59	female	0	12	22
933	1007.994941	68.26	female	0	12	22
988	1578.542814	64.53	male	0	12	22
522	1955.168187	69.87	female	3	12	22
765	2581.870402	64.79	female	0	12	22

Correlation(상관계수) & Covariance(공분산)

In [212]:

df.age.corr(df.earn)

Out[212]:

0.074003491778360547

In [213]:

df.age.cov(df.earn)

Out[213]:

36523.6992104089

In [214]:

df_info.corrwith(df.earn)

Out[214]:

earn      1.000000
height    0.291600
age       0.074003
dtype: float64

추가적으로 공분산과 상관계수의 차이를 찾아보았다.
공분산은 단위변수에 영향을 받는다. 따라서 상관계수는 그것을 보완하기 위하여,
절대적 크기에 영향을 받지 않도록 단위화 시켰다고 생각하면 된다.즉, 분산의 크기만큼 나누었다고 생각하면 된다.
따라서 상관계수는 -1 <= 상관계수 <= 1 의 성질을 가진다.

출처) http://destrudo.tistory.com/15

corr 와 corrwith 의 차이도 무엇인지 찾아보니, 이미 stackoverflow 에 친절히 누가 물어봐주셨다
고마워요!!

corrwith (두개 dataframe 간에)
computes correlation with another dataframe:
between rows or columns of two DataFrame objects

corr (한 dataframe 에서 column 간에)
computes it with itself
Compute pairwise correlation of columns

출처) https://stackoverflow.com/questions/46041148/pandas-corr-vs-corrwith

Pandas_1.ipynb

In [ ]:

'머신러닝' 카테고리의 다른 글

[데이터전처리_1] Feature Scaling (0)	2018.10.16
Pandas 너 뭐니?_두번째 (0)	2018.06.26
numpy 를 이해해보자 (2)	2018.06.01
머신러닝 분류 (0)	2018.05.16
의사결정나무(decisiontree)_2 (0)	2018.05.14

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

글 보관함

Make your data chart, easily