티스토리 뷰

머신러닝

Pandas 너 뭐니?_두번째

느린 개미 2018. 6. 26. 22:24

아래 Pandas 관련 내용은 인프런 : 밑바닥부터 시작하는 머신러닝 입문 과정의 최성철 교수님 강의의 pandas 부분을 수강하고, 나름대로 한번 정리를 하여 더 오래 기억하고자 작성한 사항입니다.

일부 추가, 삭제, 수정한 사항들도 있습니다.

1. Groupby

In [4]:

# data from: 
import pandas as pd 

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)
df

Out[4]:

	Points	Rank	Team	Year
0	876	1	Riders	2014
1	789	2	Riders	2015
2	863	2	Devils	2014
3	673	3	Devils	2015
4	741	3	Kings	2014
5	812	4	kings	2015
6	756	1	Kings	2016
7	788	1	Kings	2017
8	694	2	Riders	2016
9	701	4	Royals	2014
10	804	1	Royals	2015
11	690	2	Riders	2017

In [6]:

df.groupby("Team")["Points"].sum()

Out[6]:

Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64

("Team") : 묶음의 기준이 되는 칼럼
["Points"] : 적용받는 컬럼
sum() : 적용받는 연산

In [37]:

df.groupby("Team", as_index=False )["Points"].sum() 
#as_index 옵션을 통해 아래와 같이 할 수도 있음

Out[37]:

	Team	Points
0	Devils	1536
1	Kings	2285
2	Riders	3049
3	Royals	1505
4	kings	812

한 개이상의 column을 묶을 수 있음

In [9]:

h_index = df.groupby(["Team", "Year"])["Points"].sum()
h_index

Out[9]:

Team    Year
Devils  2014    863
        2015    673
Kings   2014    741
        2016    756
        2017    788
Riders  2014    876
        2015    789
        2016    694
        2017    690
Royals  2014    701
        2015    804
kings   2015    812
Name: Points, dtype: int64

unstack()

Group으로 묶여진 데이터를 matrix 형태로 전환해줌

In [10]:

h_index.unstack()

Out[10]:

Year	2014	2015	2016	2017
Team
Devils	863.0	673.0	NaN	NaN
Kings	741.0	NaN	756.0	788.0
Riders	876.0	789.0	694.0	690.0
Royals	701.0	804.0	NaN	NaN
kings	NaN	812.0	NaN	NaN

Hierarchical index – swaplevel

index 레벨을 변경할 수 있음

In [13]:

h_index.swaplevel()

Out[13]:

Year  Team  
2014  Devils    863
2015  Devils    673
2014  Kings     741
2016  Kings     756
2017  Kings     788
2014  Riders    876
2015  Riders    789
2016  Riders    694
2017  Riders    690
2014  Royals    701
2015  Royals    804
      kings     812
Name: Points, dtype: int64

In [15]:

h_index.swaplevel().sort_index(level=0)

Out[15]:

Year  Team  
2014  Devils    863
      Kings     741
      Riders    876
      Royals    701
2015  Devils    673
      Riders    789
      Royals    804
      kings     812
2016  Kings     756
      Riders    694
2017  Kings     788
      Riders    690
Name: Points, dtype: int64

In [ ]:

Index 레벨을 기준으로 기본 연산 수행 가능

In [16]:

h_index.sum(level=0)

Out[16]:

Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64

In [17]:

h_index.sum(level=1)

Out[17]:

Year
2014    3181
2015    3078
2016    1450
2017    1478
Name: Points, dtype: int64

2. Groupby 에 의해 split 된 정보 추출

In [18]:

grouped = df.groupby("Team")

In [19]:

for name, group in grouped:
    print(name)
    print(group)

Devils
   Points  Rank    Team  Year
2     863     2  Devils  2014
3     673     3  Devils  2015
Kings
   Points  Rank   Team  Year
4     741     3  Kings  2014
6     756     1  Kings  2016
7     788     1  Kings  2017
Riders
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
8      694     2  Riders  2016
11     690     2  Riders  2017
Royals
    Points  Rank    Team  Year
9      701     4  Royals  2014
10     804     1  Royals  2015
kings
   Points  Rank   Team  Year
5     812     4  kings  2015

특정 key 값을 가진 group 만 추출 가능

In [22]:

grouped.get_group("Devils")

Out[22]:

	Points	Rank	Team	Year
2	863	2	Devils	2014
3	673	3	Devils	2015

3. Group 정보에 세가지 유형의 apply 가 가능

Aggregation : 요약된 통계정보를 추출해 줌
Transformation: 해당 정보를 변환해줌
Filtration: 특정 정보를 제거 하여 보여주는 필터링 기능

Aggretation

In [24]:

grouped.agg(sum)

Out[24]:

	Points	Rank	Year
Team
Devils	1536	5	4029
Kings	2285	5	6047
Riders	3049	7	8062
Royals	1505	5	4029
kings	812	4	2015

In [28]:

import numpy as np

grouped.agg(np.mean)

Out[28]:

	Points	Rank	Year
Team
Devils	768.000000	2.500000	2014.500000
Kings	761.666667	1.666667	2015.666667
Riders	762.250000	1.750000	2015.500000
Royals	752.500000	2.500000	2014.500000
kings	812.000000	4.000000	2015.000000

특정 컬럼에 여러개의 function을 Apply 할 수도 있음

In [30]:

grouped["Points"].agg([np.sum, np.mean, np.std])

Out[30]:

	sum	mean	std
Team
Devils	1536	768.000000	134.350288
Kings	2285	761.666667	24.006943
Riders	3049	762.250000	88.567771
Royals	1505	752.500000	72.831998
kings	812	812.000000	NaN

In [41]:

df.groupby(["Team"]).agg({"Points":sum, "Rank":min, "Year":"count"})

Out[41]:

	Points	Rank	Year
Team
Devils	1536	2	2
Kings	2285	1	3
Riders	3049	1	4
Royals	1505	1	2
kings	812	4	1

In [42]:

df.groupby(["Team"]).agg({"Points": [min, max, sum], 
                          "Rank":"count", 
                          "Year":["first", "unique"]})

Out[42]:

	Points			Rank	Year
	min	max	sum	count	first	unique
Team
Devils	673	863	1536	2	2014	[2014, 2015]
Kings	741	788	2285	3	2014	[2014, 2016, 2017]
Riders	690	876	3049	4	2014	[2014, 2015, 2016, 2017]
Royals	701	804	1505	2	2014	[2014, 2015]
kings	812	812	812	1	2015	[2015]

transformation

Aggregation과 달리 key값 별로 요약된 정보가 아님
개별 데이터의 변환을 지원함
max 나 min 처럼 Series 데이터에 적용되는 데이터들은 key 값을 기준으로 Grouped 된 데이터 기준

In [34]:

grouped.transform(np.mean) # 위의 agg 와 결과 차이를 살펴보자

Out[34]:

	Points	Rank	Year
0	762.250000	1.750000	2015.500000
1	762.250000	1.750000	2015.500000
2	768.000000	2.500000	2014.500000
3	768.000000	2.500000	2014.500000
4	761.666667	1.666667	2015.666667
5	812.000000	4.000000	2015.000000
6	761.666667	1.666667	2015.666667
7	761.666667	1.666667	2015.666667
8	762.250000	1.750000	2015.500000
9	752.500000	2.500000	2014.500000
10	752.500000	2.500000	2014.500000
11	762.250000	1.750000	2015.500000

filter

특정 조건으로 데이터를 검색할 때 사용
filter 안에는 boolen 조건이 존재해야함
len(x)는 grouped된 dataframe 개수

In [36]:

df.groupby("Team").filter(lambda x: len(x)>=3 )

Out[36]:

	Points	Rank	Team	Year
0	876	1	Riders	2014
1	789	2	Riders	2015
4	741	3	Kings	2014
6	756	1	Kings	2016
7	788	1	Kings	2017
8	694	2	Riders	2016
11	690	2	Riders	2017

4. Pivot Table

우리가 Excel 에서 보던 그 것
Index 축은 groupby와 동일
Column에 추가로 labelling 값을 추가
Value에 numeric type 값을 aggregation 하는 형태

In [48]:

import dateutil

df_phone = pd.read_csv("phone_data.csv")
df_phone['date'] = df_phone['date'].apply(dateutil.parser.parse, dayfirst=True)
df_phone.head()

Out[48]:

	index	date	duration	item	month	network	network_type
0	0	2014-10-15 06:58:00	34.429	data	2014-11	data	data
1	1	2014-10-15 06:58:00	13.000	call	2014-11	Vodafone	mobile
2	2	2014-10-15 14:46:00	23.000	call	2014-11	Meteor	mobile
3	3	2014-10-15 14:48:00	4.000	call	2014-11	Tesco	mobile
4	4	2014-10-15 17:27:00	4.000	call	2014-11	Tesco	mobile

In [49]:

df_phone.pivot_table(["duration"],
                     index = [df_phone.month, df_phone.item],
                    columns = df_phone.network, aggfunc="sum", fill_value=0)

Out[49]:

		duration
	network	Meteor	Tesco	Three	Vodafone	data	landline	special	voicemail	world
month	item
2014-11	call	1521	4045	12458	4316	0.000	2906	0	301	0
	data	0	0	0	0	998.441	0	0	0	0
	sms	10	3	25	55	0.000	0	1	0	0
2014-12	call	2010	1819	6316	1302	0.000	1424	0	690	0
	data	0	0	0	0	1032.870	0	0	0	0
	sms	12	1	13	18	0.000	0	0	0	4
2015-01	call	2207	2904	6445	3626	0.000	1603	0	285	0
	data	0	0	0	0	1067.299	0	0	0	0
	sms	10	3	33	40	0.000	0	0	0	0
2015-02	call	1188	4087	6279	1864	0.000	730	0	268	0
	data	0	0	0	0	1067.299	0	0	0	0
	sms	1	2	11	23	0.000	0	2	0	0
2015-03	call	274	973	4966	3513	0.000	11770	0	231	0
	data	0	0	0	0	998.441	0	0	0	0
	sms	0	4	5	13	0.000	0	0	0	3

5. Crosstab

두 칼럼에 교차 빈도, 비율, 덧셈 등을 구할 때 사용
Pivot Table 의 특수한 형태
User-Item Rating Matrix 등을 만들 때 사용 가능함

In [51]:

df_movie = pd.read_csv("movie_rating.csv")
df_movie.head()

Out[51]:

	critic	title	rating
0	Jack Matthews	Lady in the Water	3.0
1	Jack Matthews	Snakes on a Plane	4.0
2	Jack Matthews	You Me and Dupree	3.5
3	Jack Matthews	Superman Returns	5.0
4	Jack Matthews	The Night Listener	3.0

In [55]:

pd.crosstab(index = df_movie.critic,
           columns = df_movie.title,
           values = df_movie.rating, aggfunc="first").fillna(0)

Out[55]:

title	Just My Luck	Lady in the Water	Snakes on a Plane	Superman Returns	The Night Listener	You Me and Dupree
critic
Claudia Puig	3.0	0.0	3.5	4.0	4.5	2.5
Gene Seymour	1.5	3.0	3.5	5.0	3.0	3.5
Jack Matthews	0.0	3.0	4.0	5.0	3.0	3.5
Lisa Rose	3.0	2.5	3.5	3.5	3.0	2.5
Mick LaSalle	2.0	3.0	4.0	3.0	3.0	2.0
Toby	0.0	0.0	4.5	4.0	0.0	1.0

In [57]:

df_movie.pivot_table(["rating"],
                    index = df_movie.critic,
                    columns = df_movie.title, 
                    aggfunc = "sum", fill_value = 0)

Out[57]:

	rating
title	Just My Luck	Lady in the Water	Snakes on a Plane	Superman Returns	The Night Listener	You Me and Dupree
critic
Claudia Puig	3.0	0.0	3.5	4.0	4.5	2.5
Gene Seymour	1.5	3.0	3.5	5.0	3.0	3.5
Jack Matthews	0.0	3.0	4.0	5.0	3.0	3.5
Lisa Rose	3.0	2.5	3.5	3.5	3.0	2.5
Mick LaSalle	2.0	3.0	4.0	3.0	3.0	2.0
Toby	0.0	0.0	4.5	4.0	0.0	1.0

6. Merge

SQL 에서 많이 사용하는 Merge 와 같은 기능
두 개의 데이터를 하나로 합침

subject_id 기준으로 merge

In [ ]:

pd.merge(df_a, df_b, on='subject_id')

두 dataframe이 column이름이 다를 때

In [ ]:

pd.merge(df_a, df_b, left_on='subject_id', right_on='subject_id2')

left join

In [ ]:

pd.merge(df_a, df_b, on='subject_id', how='left')

right join

In [ ]:

pd.merge(df_a, df_b, on='subject_id', how='right')

outer join

In [ ]:

pd.merge(df_a, df_b, on='subject_id', how='outer')

inner join

In [ ]:

pd.merge(df_a, df_b, on='subject_id', how='inner')

index based join

In [ ]:

pd.merge(df_a, df_b, right_index = True, left_index = True)

7. Concat

같은 형태의 데이터를 붙이는 연산작업

row 아래로 붙임

In [ ]:

df_new = pd.concat([df_a, df_b])
df_nex.reset_index()

또는

In [ ]:

df_a.append(df_b)

column 옆으로 붙임

In [ ]:

df_new = pd.concat([df_a, df_b], axis=1)
df_nex.reset_index()

8. DB Persistence

DB loading 시 db connection 기능을 제공함

In [62]:

import sqlite3

conn = sqlite3.connect("./data/flights.db")
cur = conn.cursor()
cur.execute("select * from airlines limit 5;")
results = cur.fetchall()

db 연결 conn 을 사용하여 dataframe 생성

In [68]:

df_airlines = pd.read_sql_query("select * from airlines;", conn)

9. XLS Persistence

DataFrame 의 엑셀 추출 코드
Xls 엔진으로 openpyxls 또는 XlsxWrite 사용

In [71]:

writer = pd.ExcelWriter("./data/df_airlines.xlsx", engine = 'xlsxwriter')
df_airlines.to_excel(writer, sheet_name='Sheet1')

10. Pickle Persistence

가장 일반적인 python 파일 persistence
to_pickle, read_pickle 함수 사용

In [72]:

df_airlines.to_pickle("./data/df_airlines.pickle")

In [73]:

df_airlines_pickle = pd.read_pickle("./data/df_airlines.pickle")
df_airlines_pickle.head()

Out[73]:

	index	id	name	alias	iata	icao	callsign	country	active
0	0	1	Private flight	\N	-	None	None	None	Y
1	1	2	135 Airways	\N	None	GNL	GENERAL	United States	N
2	2	3	1Time Airline	\N	1T	RNX	NEXTIME	South Africa	Y
3	3	4	2 Sqn No 1 Elementary Flying Training School	\N	None	WYT	None	United Kingdom	N
4	4	5	213 Flight Unit	\N	None	TFU	None	Russia	N

'머신러닝' 카테고리의 다른 글

[데이터전처리_2] missing value 처리 (0)	2018.10.17
[데이터전처리_1] Feature Scaling (0)	2018.10.16
Pandas 너 뭐니?_첫번째 (0)	2018.06.10
numpy 를 이해해보자 (2)	2018.06.01
머신러닝 분류 (0)	2018.05.16

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

글 보관함

Make your data chart, easily