padas¶

https://pandas.pydata.org/

빅데이터 시대

데이터로 부터 유용한 정보를 뽑아내는 분석프로세스를 위해
데이터를 수집하고 정리하는 데 최적화된 도구

판다스 자료 구조¶

분석을 위해 다양한 소스로 부터 수집하는 데이터는 형태나 속성이 매우 다양함
서로 다른 형식을 갖는 여러 종류의 데이터를 컴퓨터가 이해 할 수 있도록 동일한 형식을 갖는 구조로 통합 해야함
Series 와 Dataframe 이라는 구조화된 데이터 형식을 제공
서로다른 여러가지 유형의 데이터를 공통의 포맷으로 정리하는 목적
Dataframe : 행과 열로 이루어진 2차원 구조의 형태로 데이터 분석 실무에 자주 사용됨
```
          (여러 Series가 합쳐진 형태)
```

1. 시리즈(Series)¶

데이터가 순차적으로 나열된 1차원 배열의 형태
인덱스(index)는 데이터값(value)와 일대일 대응
파이썬의 딕셔너리와 비슷한 구조

딕셔너리 -> 시리즈¶

pandas.Series(딕셔너리)

import pandas as pd

dict_data = {'a':1,'b':2,'c':3}
sr = pd.Series(dict_data)
print(type(sr))
print()
print(sr)

<class 'pandas.core.series.Series'>

a    1
b    2
c    3
dtype: int64

obj = pd.Series([4,7,-5,3]) #인덱스 설정을 안하면 기본적으로  0~으로 설정된다
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

Series의 index / value¶

Series객체.index : 인덱스 배열
Series객체.values : 데이터값 배열

print(obj.values)
print(obj.index)

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)

import pandas as pd
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')

import numpy as np
import pandas as pd
list_A = np.array(list('abcdef'))
list_B = np.arange(10,70,10)
dict_data = {key: value for key, value in zip(list_A,list_B)}
print(dict_data)
sr = pd.Series(dict_data)
print(sr)

{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50, 'f': 60}
a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64

import numpy as np
import pandas as pd
list_A = np.array(list('abcdef'))
list_B = np.arange(10,70,10)
sr = pd.Series(list_B, index = list_A)
print(sr)

for i in range(sr.size):
    key = sr.index[i]
    print("sr['{}'] : {} or sr[{}] : {}".format(key,sr[key],i, sr.values[i]))

a    10
b    20
c    30
d    40
e    50
f    60
dtype: int32
sr['a'] : 10 or sr[0] : 10
sr['b'] : 20 or sr[1] : 20
sr['c'] : 30 or sr[2] : 30
sr['d'] : 40 or sr[3] : 40
sr['e'] : 50 or sr[4] : 50
sr['f'] : 60 or sr[5] : 60

print(sr['a'],sr[0],sr.values[0])
print(sr.index[0])

10 10 10
a

print(obj2) ; print()
print(obj2[obj2>0])

d    4
b    7
a   -5
c    3
dtype: int64

d    4
b    7
c    3
dtype: int64

print(obj2*2)

d     8
b    14
a   -10
c     6
dtype: int64

print(np.exp(obj2))

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

print('b' in obj2)
print('e' in obj2)

True
False

sdata = {'Ohio' : 35000, 'Taxas' : 71000, 'Oregon': 16000 ,'Utah' : 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Taxas     71000
Oregon    16000
Utah       5000
dtype: int64

states = ['California','Ohio', 'Taxas', 'Oregon']   #list -> 배정된 순서대로
obj4 = pd.Series(sdata, index=states)               #set -> 랜덤순서
print(obj4)

California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
dtype: float64

import pandas as pd
print(pd.isnull(obj4))
print(pd.notnull(obj4))

California     True
Ohio          False
Taxas         False
Oregon        False
dtype: bool
California    False
Ohio           True
Taxas          True
Oregon         True
dtype: bool

print(obj4.isnull())

California     True
Ohio          False
Taxas         False
Oregon        False
dtype: bool

print(obj3); print()
print(obj4); print() 
print(obj3+obj4)

Ohio      35000
Taxas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Taxas         142000.0
Utah               NaN
dtype: float64

#print(obj4.name)
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)

state
California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
Name: population, dtype: float64

print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

obj.index = ['Bob','Steve','Jeff','Ryan']
print(obj)

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

2. 데이터프레임(DataFrame)¶

2차원 배열
R의 데이터 프레임에서 유래
엑셀, 관계형 DB등에서 사용됨
하나의 열이 각각의 Series객체임

행 인덱스/열 이름 설정 :
pandas.DataFrame(2차원 배열, index = 행 인덱스 배열, columns = 열 이름 배열)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

frame

#행 이름의 순서 변경
pd.DataFrame(data,columns=['year','state','pop'])

frame2 =pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
frame2

행 인덱스 변경 : DataFrame 객체.rename(index={기존 인덱스:새 인덱스, ...})
열 이름 변경 : DataFrame 객체.rename(colums={기존 이름:새 이름, ...})

frame2.rename(columns={'year':'YEA','state':'STA','pop':'POP','debt':'DEB'},inplace = True)
frame2.head()
frame2.rename(index={'one':'01','two':'02','three':'03','four':'04'},inplace = True)
frame2.head()

frame2['STA']

01        Ohio
02        Ohio
03        Ohio
04      Nevada
five    Nevada
six     Nevada
Name: STA, dtype: object

frame.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

frame2.loc['03']

YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: 03, dtype: object

frame2.iloc[2]

YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: 03, dtype: object

frame2['DEB'] = 16.5
frame2

frame2['DEB'] = np.arange(1,13,2)
frame2

val = pd.Series([-1.2,-1.5,-1.7],index = ['02','four','six'])
frame2['DEB'] = val
frame2

frame2['eastern'] = frame2.STA == 'Ohio'
frame2

frame2['Big_state'] = (frame2.STA == 'Ohio') & (frame2.POP > 3.0)
frame2

del frame2['eastern']
frame2.head()

del frame2['Big_state']
frame2.head()

중첩된 딕셔너리¶

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

frame3.T

pd.DataFrame(pop,index=[2001,2002,2003])

print(frame3.iloc[0,0])
print(frame3.iloc[0,1])
print(frame3.iloc[1,0])
print(frame3.iloc[1,1])

2.4
1.7
2.9
3.6

frame3.iloc[0,[0,1]]

Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64

frame3.iloc[0,0:]

Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64

pdata = {'Ohio' : frame3['Ohio'][:-1],'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

titanic.head()

titanic.tail()

df = titanic.loc[:,['age','fare']]

df.head()

df.tail()

df_add10 =df+10
df_add10.head()

print(type(df_add10))

<class 'pandas.core.frame.DataFrame'>

df_sub = df_add10 - df
df_sub.head()

obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
print(index)
index[1:]

Index(['a', 'b', 'c'], dtype='object')

Index(['b', 'c'], dtype='object')

import numpy as np
labels = pd.Index(np.arange(3))
print(labels)
obj2 = pd.Series([1.5,-2.5,0], index = labels)
obj2

Int64Index([0, 1, 2], dtype='int64')

0    1.5
1   -2.5
2    0.0
dtype: float64

obj2.index is labels

True

dup_labels = pd.Index(['foo','foo','bar','bar'])    #인덱스가 중복되어도 된다
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

obj = pd.Series([4.5,7.2,-5.3,3.6], index = ['d','b','a','c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['d','b','a','c','e'])
obj2

d    4.5
b    7.2
a   -5.3
c    3.6
e    NaN
dtype: float64

obj3 = pd.Series(['blue','purple','yellow'], index =[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

obj3.reindex(range(6),method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

import numpy as np
import pandas as pd
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'], columns = ['Ohio','Texas','California'])
frame

frame2 = frame.reindex(['a','b','c'])
frame2

states = ['Texas','Utah','California']
frame.reindex(columns=states)

obj = pd.Series(np.arange(5.), index = ['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

new_obj2 = obj.drop(['c','d'])
new_obj2

a    0.0
b    1.0
e    4.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data

#drop은 기본적으로 행을 삭제한다. 열을 삭제할 경우에는 추가적으로 입력해야함
data.drop(['Colorado','Ohio'])

data2 = data.drop('two',axis=1)
data2.drop('Utah',axis=0)

data.drop(['two','four'],axis=1)

data.drop('two', axis=1)

data.drop(['Ohio'],axis='rows') # axis = 'rows' or 1은 생략가능

data.drop('Ohio')
data

data3 = data.copy()
data3.drop('Ohio',inplace=True)
data3

인덱싱¶

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

print(obj['b'],obj[1]) ; print()
print(obj[2:4]) ; print()
print(obj[['b','a','d']]) ; print()
print(obj[[1,3]]) ; print()
print(obj<2) ; print()

1.0 1.0

c    2.0
d    3.0
dtype: float64

b    1.0
a    0.0
d    3.0
dtype: float64

b    1.0
d    3.0
dtype: float64

a     True
b     True
c    False
d    False
dtype: bool

obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three','one']]

data[:2]

data[data['three']>5]

data[data<5] = 0
data

iloc / loc의 차이

iloc은 번호로(인덱스 번호)
loc은 열이름

data.loc['Colorado',['two','three']]

two      5
three    6
Name: Colorado, dtype: int32

#two의 Utah까지
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

data.iloc[[1,2],[3,0,1]]

data

data.iloc[:,:3][data.three>5]

ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

print(ser[:1])
print(ser.loc[:1])
print(ser.iloc[:1])

0    0.0
dtype: float64
0    0.0
1    1.0
dtype: float64
0    0.0
dtype: float64

np.random.randn : 평균 0, 표준편차가 1인 가우시안 정규분포 난수 matrix 생성

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

np.abs(frame)

f = lambda x:x.max()-x.min()

frame.apply(f)

b    2.119906
d    2.090682
e    3.890894
dtype: float64

frame.apply(f, axis='columns')

Utah      0.825604
Ohio      4.022541
Texas     2.293994
Oregon    3.501372
dtype: float64

def f(x):
    return pd.Series([x.min(),x.max()], index=['min','max'])
frame.apply(f)

Sort Ranking¶

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

# index를 기준으로 sort
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

#행을 오름차순으로 정렬
frame.sort_index()

#열을 오름차순으로 정렬
frame.sort_index(axis=1)

#역행으로 정렬
frame.sort_index(axis=1,ascending=False)

obj = pd.Series([4, 7, -3, 2])

#값이 낮은 순서대로 정렬
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

#b를 기준으로 작은 숫자부터 정렬
frame.sort_values(by='b')

# a를 정렬하고 난 후에 b를 정렬
frame.sort_values(by=['a','b'])

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj)

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

# ascending(오름차순) = False -> 내림차순으로 나타냄
# max : 동일 데이터가 여러개 있을 때 큰 것으로 표시 
obj.rank(ascending= False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

#한 행에 있는 열 값을 정렬
frame.rank(axis=1)

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique

False

obj['a']

a    0
a    1
dtype: int64

obj['c']

4

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

df.loc['b']

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

df.sum()

one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

# skipna : NaN값을 제외하고 계산
df.mean(axis='columns',skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

# 최대, 최소값을 가지고 있는 index를 반환
df.idxmax()

one    b
two    d
dtype: object

# 누적합
df.cumsum()

df.describe()

Unique Values, Value Counts, and Membership¶

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values,sort=False)

a    3
d    1
b    2
c    3
dtype: int64

obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

# b,c가 안에 있는지
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
to_match

0    c
1    a
2    b
3    b
4    c
5    a
dtype: object

unique_vals = pd.Series(['c', 'b', 'a'])
unique_vals

0    c
1    b
2    a
dtype: object

pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})

data

#각 열의 숫자가 몇 번 카운트되었는지
result = data.apply(pd.value_counts).fillna(0)
result

isin : Series의 각 원소가 넘겨받은 연속된 값에 속하는 지 나타내는 bool배열을 반환
match : 각 값에 대해 유일한 값을 담고 있는 배열에서의 정수 색인을 계산.
unique : Series에서 중복되는 값을 제거하고 유일한 값만 포함하는 배열을 반환
value_count : Series에서 유일값에 대한 색인과 두수를 계산 (도수는 내림차순)

data['Qu1'].value_counts()[:2]

4    2
3    2
Name: Qu1, dtype: int64

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important; }</style>"))

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
886	0	2	male	27.0	0	0	13.00	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.00	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.45	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.00	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.75	Q	Third	man	True	NaN	Queenstown	no	True

	age	fare
0	22.0	7.2500
1	38.0	71.2833
2	26.0	7.9250
3	35.0	53.1000
4	35.0	8.0500

	age	fare
886	27.0	13.00
887	19.0	30.00
888	NaN	23.45
889	26.0	30.00
890	32.0	7.75

	age	fare
0	32.0	17.2500
1	48.0	81.2833
2	36.0	17.9250
3	45.0	63.1000
4	45.0	18.0500

Ji Hee's cording

08.Python Pandas

padas¶

판다스 자료 구조¶

1. 시리즈(Series)¶

딕셔너리 -> 시리즈¶

Series의 index / value¶

2. 데이터프레임(DataFrame)¶

중첩된 딕셔너리¶

인덱싱¶

Sort Ranking¶

Unique Values, Value Counts, and Membership¶

'Python' 카테고리의 다른 글

댓글

티스토리툴바

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

	YEA	STA	POP	DEB
01	2000	Ohio	1.5	16.5
02	2001	Ohio	1.7	16.5
03	2002	Ohio	3.6	16.5
04	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5
six	2003	Nevada	3.2	16.5

	b	d	e
Utah	-0.729436	0.038952	-0.786652
Ohio	0.994167	0.910993	-3.028374
Texas	1.390470	-0.903524	0.862521
Oregon	1.288409	-1.179690	-2.212963

	b	d	e
Utah	0.729436	0.038952	0.786652
Ohio	0.994167	0.910993	3.028374
Texas	1.390470	0.903524	0.862521
Oregon	1.288409	1.179690	2.212963

	0	1	2
a	0.952780	-0.107355	-2.503290
a	1.798242	0.003292	-0.084674
b	0.242779	-0.224056	-1.409196
b	0.358826	1.316739	0.834168

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

09.Python Pandas - csv,xlsx (0)	2020.09.10
Python pandas 음식점 예제 (0)	2020.09.09
07.Python matplotlib (0)	2020.09.08
06.Python(Numpy 행렬) (0)	2020.09.08
05.Python Numpy (0)	2020.09.07

	age	fare
0	10.0	10.0
1	10.0	10.0
2	10.0	10.0
3	10.0	10.0
4	10.0	10.0

08.Python Pandas

padas¶

판다스 자료 구조¶

1. 시리즈(Series)¶

딕셔너리 -> 시리즈¶

Series의 index / value¶

2. 데이터프레임(DataFrame)¶

중첩된 딕셔너리¶

인덱싱¶

Sort Ranking¶

Unique Values, Value Counts, and Membership¶

'Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바