티스토리 뷰

머신러닝

numpy 를 이해해보자

느린 개미 2018. 6. 1. 18:53

[덧붙임] 구글링을 하다가 NumPy에 대해 너무 잘 정리를 해놓은 곳이 있어 링크시켜놓는다. 이런 양질의 글 감사합니다

http://taewan.kim/post/numpy_cheat_sheet/

아래 Numpy 관련 내용은 인프런 : 밑바닥부터 시작하는 머신러닝 입문 과정의 최성철 교수님 강의의 numpy 부분을 수강하며 정리한 내용입니다.

일부 추가, 삭제, 수정한 사항들도 있습니다.

1. Numpy 는?

파이썬의 고성능 과학 계산용 패키지
Matrix 와 Vector 같은 Array 연산의 사실상의 표준

2. 특징

일반 List 에 비해 빠르고, 메모리 효율적
반복문 없이 데이터 배열에 대한 처리를 지원

3. Creation

일반적인 numpy 의 호출 방법, 특별한 이유는 없고 세계적인 약속 같은 것

In [534]:

import numpy as np

np.array 함수를 활용하여 배열을 생성함 -> ndarray
하나의 데이터 type 만 배열에 넣을 수 있음
List 와 가장 큰 차이점은, Dynamic typing not supported 다.

In [535]:

test_array = np.array([1,4,5,8], float)
print(test_array)
print(type(test_array))
print(type(test_array[3]))

[ 1.  4.  5.  8.]
<class 'numpy.ndarray'>
<class 'numpy.float64'>

dtype : numpy array 의 데이터 type 반환
shape : numpy array 의 obejct dimension 구성을 tuple 형태로 반환
ndim : 차원 반환
size : data 의 개수
nbytes : 메모리 크기를 반환함

In [536]:

print(test_array.dtype)
print(test_array.shape)
print(test_array.ndim)
print(test_array.size)
print(test_array.nbytes)

float64
(4,)
1
4
32

In [537]:

matrix  = [ [1,2,5,8], [1,2,5,8], [1,2,5,8]]
np.array(matrix, int).dtype

Out[537]:

dtype('int32')

data type 으로 각 element 가 차지하는 memory 크기가 결정된다.
C 의 data type 과 compatible

4. Handling shape

reshape : Array 의 shape 의 크기를 변경함 (이때, element 의 갯수는 동일)
size 만 같다면 다차원으로 자유로이 변형가능

In [538]:

test_matrix = [ [1,2,3,4], [1,2,5,8]]
np.array(test_matrix).shape

Out[538]:

(2, 4)

In [539]:

np.array(test_matrix).reshape(8,)

Out[539]:

array([1, 2, 3, 4, 1, 2, 5, 8])

In [540]:

np.array(test_matrix).reshape(8,).shape

Out[540]:

(8,)

-1 : size 를 기반으로 row 개수 선정 (아래 예를 보면 column 2 개를 기준으로 row 결정)

In [541]:

np.array(test_matrix).reshape(-1,2).shape

Out[541]:

(4, 2)

flatten : 다차원 array 를 1차원 array 로 변환

In [542]:

test_matrix = [[ [1,2,3,4], [1,2,5,8]], [ [1,2,3,4], [1,2,5,8]]]
np.array(test_matrix).flatten()

Out[542]:

array([1, 2, 3, 4, 1, 2, 5, 8, 1, 2, 3, 4, 1, 2, 5, 8])

5. Indexing & slicing

List 와 달리 이차원 배열에서 [0,0] 과 같은 표기법을 제공함
Matrix 일 경우 앞은 row, 뒤는 column 을 의미함

In [543]:

test_matrix = np.array([ [1,2,3,4], [1,2,5,8]], dtype =int)
test_matrix

Out[543]:

array([[1, 2, 3, 4],
       [1, 2, 5, 8]])

In [544]:

test_matrix[0][0]

Out[544]:

In [545]:

test_matrix[0,0]

Out[545]:

In [546]:

test_matrix[0,0] = 12 #matrix 0,0 에 12 할당
test_matrix

Out[546]:

array([[12,  2,  3,  4],
       [ 1,  2,  5,  8]])

List 와 달리 행과 열 부분을 나눠서 slicing 이 가능함

In [547]:

a = np.array([[1,2,3,4,5], [6,7,8,9,10]], int)
a

Out[547]:

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

In [548]:

print(a[:,2:])  # 전체 Row 의 2열 이상
print("\n") 
print(a[1:,1:3]) # 1 Row 의 1열-2열
print("\n") 
print(a[1:3]) #1Row ~ 2Row 의 전체

[[ 3  4  5]
 [ 8  9 10]]


[[7 8]]


[[ 6  7  8  9 10]]

6. Create Function

array 의 범위를 지정하여, 값의 list 를 생성하는 명령어

In [549]:

np.arange(30)

Out[549]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [550]:

np.arange(0,5,0.5) #시작, 끝, Step

Out[550]:

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5])

zeros : 0 으로 가득찬 ndarray 생성

In [551]:

np.zeros(shape=(3,2), dtype=np.int8)

Out[551]:

array([[0, 0],
       [0, 0],
       [0, 0]], dtype=int8)

ones : 1로 가득찬 ndarray 생성

In [552]:

np.ones(shape=(3,2), dtype=np.int8)

Out[552]:

array([[1, 1],
       [1, 1],
       [1, 1]], dtype=int8)

empty : shape 만 주어지고 비어있는 ndarray 생성 (memory initialization 이 되지 않음)

In [553]:

np.empty(shape=(3,2), dtype=np.int8)

Out[553]:

array([[0, 0],
       [0, 0],
       [0, 0]], dtype=int8)

something_like : 기존 ndarray 의 shape 크기 만큼 1,0 또는 empty array 를 반환

In [554]:

test_matrix = np.arange(30).reshape(5,6)
np.ones_like(test_matrix)

Out[554]:

array([[1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1]])

In [555]:

np.empty_like(test_matrix)

Out[555]:

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

identity : 단위 행령을 생성함

In [556]:

np.identity(n=3, dtype=np.int8)

Out[556]:

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=int8)

대각선인 1인 행렬, k값의 시작 index 의 변경이 가능

In [557]:

np.eye(N=3, M=5, dtype=np.int8)

Out[557]:

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0]], dtype=int8)

In [558]:

np.eye(3)

Out[558]:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [559]:

np.eye(3,5,k=2) # k => start index

Out[559]:

array([[ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

diag : 대각 행렬의 값을 추출함

In [560]:

matrix = np.arange(9).reshape(3, 3)
matrix

Out[560]:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [561]:

np.diag(matrix)

Out[561]:

array([0, 4, 8])

In [562]:

np.diag(matrix, k=1) # k => start index

Out[562]:

array([1, 5])

random sampling : 데이터 분포에 따른 sampling 으로 array 를 생성

In [563]:

np.random.uniform(0,1,10).reshape(2,5) #균등분포

Out[563]:

array([[ 0.18193674,  0.22632411,  0.11430298,  0.18663024,  0.2140152 ],
       [ 0.151614  ,  0.40406305,  0.74045557,  0.82560664,  0.4993982 ]])

In [564]:

np.random.normal(0,1,10).reshape(2,5)

Out[564]:

array([[ 0.17348277, -0.38541203,  0.64398543,  0.20520195,  1.20996678],
       [ 0.64374891, -0.65881286,  0.01750222, -0.9259097 , -0.01695853]])

7. operation functions

axis : 모든 operation function 을 실행할 때, 기준이 되는 dimension 축
axis 의 개념을 이해하기 위해 아래 김성훈 교수님 강의 영상을 참고했다
모두를 위한 딥러닝 강좌 ML lab 08: Tensor Manipulation

In [565]:

test_array = np.arange(1,13).reshape(3,4)
test_array

Out[565]:

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [566]:

test_array.sum(axis=1)

Out[566]:

array([10, 26, 42])

In [567]:

test_array.sum(axis=0)

Out[567]:

array([15, 18, 21, 24])

vstack : vertical 로 합친다.
hstack : horizional 로 합친다.

In [568]:

a = np.array([1,2,3])
b = np.array([2,3,4])
np.vstack((a,b))

Out[568]:

array([[1, 2, 3],
       [2, 3, 4]])

In [569]:

np.hstack((a,b))

Out[569]:

array([1, 2, 3, 2, 3, 4])

concatenate : numpy array 를 합치는 함수

In [570]:

a = np.array([[1,2,3]])
b = np.array([[2,3,4]])
np.concatenate( (a,b), axis = 0)

Out[570]:

array([[1, 2, 3],
       [2, 3, 4]])

In [571]:

a = np.array( [ [1,2], [3,4]])
b = np.array( [ [5,6] ] )
np.concatenate( (a,b.T), axis=1)

Out[571]:

array([[1, 2, 5],
       [3, 4, 6]])

8. array operations

기본적인 사칙 연산을 지원

In [572]:

test_a = np.array( [[1,2,3], [4,5,6]], float)
test_a

Out[572]:

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

In [573]:

test_a + test_a

Out[573]:

array([[  2.,   4.,   6.],
       [  8.,  10.,  12.]])

In [574]:

test_a - test_a

Out[574]:

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [575]:

test_a * test_a

Out[575]:

array([[  1.,   4.,   9.],
       [ 16.,  25.,  36.]])

In [576]:

test_a / test_a

Out[576]:

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

dot product : 우리가 알고 있는 Matrix 곱셈 연산

In [577]:

test_a = np.arange(1,7).reshape(2,3)
test_b = np.arange(7,13).reshape(3,2)

In [578]:

test_a.dot(test_b)

Out[578]:

array([[ 58,  64],
       [139, 154]])

transpose : 우리가 알고 있는 Matrix 곱셉 연산

In [579]:

test_a = np.arange(1,7).reshape(2,3)
test_a

Out[579]:

array([[1, 2, 3],
       [4, 5, 6]])

In [580]:

test_a.T

Out[580]:

array([[1, 4],
       [2, 5],
       [3, 6]])

broadcasting : shape 이 다른 배열 간 연산을 지원하는 기능

In [581]:

test_matrix = np.array([[1,2,3], [4,5,6]], float)
scalar = 3

In [582]:

test_matrix + scalar

Out[582]:

array([[ 4.,  5.,  6.],
       [ 7.,  8.,  9.]])

In [583]:

test_matrix - scalar

Out[583]:

array([[-2., -1.,  0.],
       [ 1.,  2.,  3.]])

In [584]:

test_matrix *  scalar

Out[584]:

array([[  3.,   6.,   9.],
       [ 12.,  15.,  18.]])

In [585]:

test_matrix / scalar

Out[585]:

array([[ 0.33333333,  0.66666667,  1.        ],
       [ 1.33333333,  1.66666667,  2.        ]])

In [586]:

test_matrix ** scalar # 3제곱

Out[586]:

array([[   1.,    8.,   27.],
       [  64.,  125.,  216.]])

In [587]:

test_matrix // scalar # Matrix - Scalar 몫

Out[587]:

array([[ 0.,  0.,  1.],
       [ 1.,  1.,  2.]])

broadcasting 은 scalar - vector 외에도 vector - matrix 간의 연산도 지원

In [588]:

test_matrix = np.arange(1,13).reshape(4,3)
test_matrix

Out[588]:

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [589]:

test_vector = np.arange(10,40,10)
test_vector

Out[589]:

array([10, 20, 30])

In [590]:

test_matrix + test_vector

Out[590]:

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39],
       [20, 31, 42]])

9. Numpy performance

timeit : jupyter 환경에서 코드의 퍼포먼스를 체크하는 함수

In [591]:

%timeit [2 * value for value in range(1000000) ]

1 loop, best of 3: 216 ms per loop

In [592]:

%timeit np.arange(1000000) * 2

100 loops, best of 3: 5.9 ms per loop

일반적인 속도는 for loop < list comprehension < numpy
Numpy 는 C로 구현되어 있어, 성능을 확보하는 대신 파이썬의 가장 큰 특징인 dynamic typing 을 포기함
대용량 계산에서는 가장 흔히 사용됨
Concatenate 처럼 계산이 아닌, 할당에서는 연산 속도의 이점이 없음 (이것은 생각해볼 말, 완벽하게 이해가 아직 안되었음;;)

10. comparisions

All & Any : Array 의 데이터 전부(And) 또는 일부(or)가 조건에 만족 여부 반환

In [593]:

a = np.arange(10)
a

Out[593]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [594]:

np.any(a > 5) # any -> 하나라도 조건에 만족한다면 true

Out[594]:

True

In [595]:

np.all(a > 5) # any -> 모두가 조건에 만족한다면 true

Out[595]:

False

Numpy 는 배열의 크기가 동일할 때, element 간 비교의 결과를 Boolean type 으로 반환하여 돌려줌

In [596]:

test_a = np.array([1,3,0], float)
test_b = np.array([5,2,1], float)
test_a > test_b

Out[596]:

array([False,  True, False], dtype=bool)

In [597]:

test_a == test_b

Out[597]:

array([False, False, False], dtype=bool)

In [598]:

(test_a > test_b).any()

Out[598]:

True

logical 연산

In [599]:

a = np.array([1,3,0], float)
np.logical_and(a > 0 , a <3 ) # and 조건의 condition

Out[599]:

array([ True, False, False], dtype=bool)

In [600]:

b = np.array([True, False, True], bool)
np.logical_not(b)

Out[600]:

array([False,  True, False], dtype=bool)

In [601]:

c = np.array([False, True, True], bool)
np.logical_or(b,c)

Out[601]:

array([ True,  True,  True], dtype=bool)

np.where (중요)

a>0 이라는 조건이 맞으면 3, 아니면 2

In [602]:

a = np.array([1,3,0], float)
np.where(a>0, 3, 2)

Out[602]:

array([3, 3, 2])

해당 조건에 해당하는 index 값 반환`

In [603]:

a = np.arange(10,20,1)
a

Out[603]:

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [604]:

np.where(a > 15)

Out[604]:

(array([6, 7, 8, 9], dtype=int64),)

np.isnan() : Not a Number

In [605]:

a = np.array ( [ 1, np.nan, np.Inf] , float)
np.isnan(a)

Out[605]:

array([False,  True, False], dtype=bool)

np.isfinite() : is finite number

In [606]:

a = np.array ( [ 1, np.nan, np.Inf] , float)
np.isfinite(a)

Out[606]:

array([ True, False, False], dtype=bool)

argmax & argmin (자주 사용됨)

array 내 최대값 또는 최소값의 index 를 반환함

In [607]:

a = np.array([1,2,3,4,5,6,7])
a

Out[607]:

array([1, 2, 3, 4, 5, 6, 7])

In [608]:

np.argmax(a), np.argmin(a)

Out[608]:

(6, 0)

axis 기반의 반환

In [609]:

a = np.array([ [1,2,4,7], [9,88,6,45], [9,76,3,4]])
a

Out[609]:

array([[ 1,  2,  4,  7],
       [ 9, 88,  6, 45],
       [ 9, 76,  3,  4]])

In [610]:

np.argmax(a, axis=1), np.argmin(a, axis=0)

Out[610]:

(array([3, 1, 1], dtype=int64), array([0, 0, 2, 2], dtype=int64))

11. boolean index

numpy 배열은 특정 조건에 따른 값을 배열 형태로 추출할 수 있음
comparison operaton 함수들도 모두 사용 가능

In [611]:

test_array = np.array([1, 4, 0, 2, 3, 8, 9, 7], float)
test_array > 3

Out[611]:

array([False,  True, False, False, False,  True,  True,  True], dtype=bool)

In [612]:

test_array[test_array > 3]

Out[612]:

array([ 4.,  8.,  9.,  7.])

In [613]:

condition = test_array < 3
test_array[condition]

Out[613]:

array([ 1.,  0.,  2.])

boolean index

In [614]:

A = np.array( [ 
     [11, 12, 13, 14, 15, 90, 1, 2, 3, 77, 5],
     [10, 12, 9, 14, 15, 0, 1, 98, 3, 4, 55],
     [13, 9, 8, 14, 12, 0, 11, 2, 43, 4, 5],
     [11, 12, 13, 14, 15, 6, 1, 9, 3, 77, 5],
  ])

In [615]:

B = A < 15
B

Out[615]:

array([[ True,  True,  True,  True, False, False,  True,  True,  True,
        False,  True],
       [ True,  True,  True,  True, False,  True,  True, False,  True,
         True, False],
       [ True,  True,  True,  True,  True,  True,  True,  True, False,
         True,  True],
       [ True,  True,  True,  True, False,  True,  True,  True,  True,
        False,  True]], dtype=bool)

In [616]:

B.astype(np.int)

Out[616]:

array([[1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1],
       [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1],
       [1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1]])

fancy index

numpy 는 array 를 index value 로 사용해서 값을 추출 가능. index value 는 반드시 int 형으로 선언

In [617]:

a = np.array([2, 4, 6, 8], float)
b = np.array([0, 0, 1, 3, 2, 1], int) # 반드시 integer 로 선언. 그렇지 않으면 에러발생
a[b]

Out[617]:

array([ 2.,  2.,  4.,  8.,  6.,  4.])

take 형태로 많이 사용함

In [618]:

a.take(b) # take 함수 : bracket index 와 같은 효과

Out[618]:

array([ 2.,  2.,  4.,  8.,  6.,  4.])

Matrix 형태의 데이터도 가능

In [619]:

a = np.array([ [1,4], [9,16]], float)
b = np.array( [0, 0, 1, 1, 0], int)
c = np.array( [0, 1, 1, 1, 1], int)
a[b,c] # b 를 row index, c 를 column index 로 변환하여 표시함

Out[619]:

array([  1.,   4.,  16.,  16.,   4.])

12. numpy data i/o

loadtxt & save txt : Text type 의 데이터를 읽고, 저장하는 기능

Text type 의 데이터를 읽고, 저장하는 기능

In [620]:

a = np.loadtxt("./populations.txt")
a

Out[620]:

array([[  1900.,  30000.,   4000.,  48300.],
       [  1901.,  47200.,   6100.,  48200.],
       [  1902.,  70200.,   9800.,  41500.],
       [  1903.,  77400.,  35200.,  38200.],
       [  1904.,  36300.,  59400.,  40600.],
       [  1905.,  20600.,  41700.,  39800.],
       [  1906.,  18100.,  19000.,  38600.],
       [  1907.,  21400.,  13000.,  42300.],
       [  1908.,  22000.,   8300.,  44500.],
       [  1909.,  25400.,   9100.,  42100.],
       [  1910.,  27100.,   7400.,  46000.],
       [  1911.,  40300.,   8000.,  46800.],
       [  1912.,  57000.,  12300.,  43800.],
       [  1913.,  76600.,  19500.,  40900.],
       [  1914.,  52300.,  45700.,  39400.],
       [  1915.,  19500.,  51100.,  39000.],
       [  1916.,  11200.,  29700.,  36700.],
       [  1917.,   7600.,  15800.,  41800.],
       [  1918.,  14600.,   9700.,  43300.],
       [  1919.,  16200.,  10100.,  41300.],
       [  1920.,  24700.,   8600.,  47300.]])

In [621]:

a_int = a.astype(int)
a_int[:3]

Out[621]:

array([[ 1900, 30000,  4000, 48300],
       [ 1901, 47200,  6100, 48200],
       [ 1902, 70200,  9800, 41500]])

In [622]:

np.savetxt('int_data.csv', a_int, delimiter=',')

numpy object(npy)
- Numpy obejct (pickle) 형태로 데이터를 저장하고 불러옴
- Binary 파일 형태로 저장함

In [623]:

np.save("npy_test", arr = a_int)

In [624]:

npy_array = np.load(file = "npy_test.npy")
npy_array[:3]

Out[624]:

array([[ 1900, 30000,  4000, 48300],
       [ 1901, 47200,  6100, 48200],
       [ 1902, 70200,  9800, 41500]])

13. 참고영상

Numpy.ipynb

TF-KR 첫 모임: Zen of NumPy
(영상을 보고 매우매우 Numpy 의 핵심을 잘 짚어주신 내용이라고 생각함. 알고 보니 밑바닥부터 시작하는 데이터 과학의 옮긴이 중의 한분이셨음 ^0^ )

In [ ]:

'머신러닝' 카테고리의 다른 글

Pandas 너 뭐니?_두번째 (0)	2018.06.26
Pandas 너 뭐니?_첫번째 (0)	2018.06.10
머신러닝 분류 (0)	2018.05.16
의사결정나무(decisiontree)_2 (0)	2018.05.14
의사결정나무(decisiontree)_1 (0)	2018.05.12

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

글 보관함

Make your data chart, easily