본문 바로가기
Python

08.Python Pandas

by owo_v 2020. 9. 9.

 

 

 

 

padas

https://pandas.pydata.org/

빅데이터 시대

  • 데이터로 부터 유용한 정보를 뽑아내는 분석프로세스를 위해
  • 데이터를 수집하고 정리하는 데 최적화된 도구
 

판다스 자료 구조

  • 분석을 위해 다양한 소스로 부터 수집하는 데이터는 형태나 속성이 매우 다양함
  • 서로 다른 형식을 갖는 여러 종류의 데이터를 컴퓨터가 이해 할 수 있도록 동일한 형식을 갖는 구조로 통합 해야함
  • Series 와 Dataframe 이라는 구조화된 데이터 형식을 제공
  • 서로다른 여러가지 유형의 데이터를 공통의 포맷으로 정리하는 목적
  • Dataframe : 행과 열로 이루어진 2차원 구조의 형태로 데이터 분석 실무에 자주 사용됨
              (여러 Series가 합쳐진 형태)
 

1. 시리즈(Series)

  • 데이터가 순차적으로 나열된 1차원 배열의 형태
  • 인덱스(index)는 데이터값(value)와 일대일 대응
  • 파이썬의 딕셔너리와 비슷한 구조
 

딕셔너리 -> 시리즈

pandas.Series(딕셔너리)

In [3]:
import pandas as pd
In [4]:
dict_data = {'a':1,'b':2,'c':3}
sr = pd.Series(dict_data)
print(type(sr))
print()
print(sr)
 
<class 'pandas.core.series.Series'>

a    1
b    2
c    3
dtype: int64
In [5]:
obj = pd.Series([4,7,-5,3]) #인덱스 설정을 안하면 기본적으로  0~으로 설정된다
print(obj)
 
0    4
1    7
2   -5
3    3
dtype: int64
 

Series의 index / value

  • Series객체.index : 인덱스 배열
  • Series객체.values : 데이터값 배열
In [6]:
print(obj.values)
print(obj.index)
 
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)
In [7]:
import pandas as pd
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
print(obj2)
print(obj2.index)
 
d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
In [8]:
import numpy as np
import pandas as pd
list_A = np.array(list('abcdef'))
list_B = np.arange(10,70,10)
dict_data = {key: value for key, value in zip(list_A,list_B)}
print(dict_data)
sr = pd.Series(dict_data)
print(sr)
 
{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50, 'f': 60}
a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64
In [9]:
import numpy as np
import pandas as pd
list_A = np.array(list('abcdef'))
list_B = np.arange(10,70,10)
sr = pd.Series(list_B, index = list_A)
print(sr)

for i in range(sr.size):
    key = sr.index[i]
    print("sr['{}'] : {} or sr[{}] : {}".format(key,sr[key],i, sr.values[i]))
 
a    10
b    20
c    30
d    40
e    50
f    60
dtype: int32
sr['a'] : 10 or sr[0] : 10
sr['b'] : 20 or sr[1] : 20
sr['c'] : 30 or sr[2] : 30
sr['d'] : 40 or sr[3] : 40
sr['e'] : 50 or sr[4] : 50
sr['f'] : 60 or sr[5] : 60
In [10]:
print(sr['a'],sr[0],sr.values[0])
print(sr.index[0])
 
10 10 10
a
In [11]:
print(obj2) ; print()
print(obj2[obj2>0])
 
d    4
b    7
a   -5
c    3
dtype: int64

d    4
b    7
c    3
dtype: int64
In [12]:
print(obj2*2)
 
d     8
b    14
a   -10
c     6
dtype: int64
In [13]:
print(np.exp(obj2))
 
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
In [14]:
print('b' in obj2)
print('e' in obj2)
 
True
False
In [15]:
sdata = {'Ohio' : 35000, 'Taxas' : 71000, 'Oregon': 16000 ,'Utah' : 5000}
obj3 = pd.Series(sdata)
print(obj3)
 
Ohio      35000
Taxas     71000
Oregon    16000
Utah       5000
dtype: int64
In [16]:
states = ['California','Ohio', 'Taxas', 'Oregon']   #list -> 배정된 순서대로
obj4 = pd.Series(sdata, index=states)               #set -> 랜덤순서
print(obj4)
 
California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
dtype: float64
In [17]:
import pandas as pd
print(pd.isnull(obj4))
print(pd.notnull(obj4))
 
California     True
Ohio          False
Taxas         False
Oregon        False
dtype: bool
California    False
Ohio           True
Taxas          True
Oregon         True
dtype: bool
In [18]:
print(obj4.isnull())
 
California     True
Ohio          False
Taxas         False
Oregon        False
dtype: bool
In [19]:
print(obj3); print()
print(obj4); print() 
print(obj3+obj4)
 
Ohio      35000
Taxas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Taxas         142000.0
Utah               NaN
dtype: float64
In [20]:
#print(obj4.name)
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)
 
state
California        NaN
Ohio          35000.0
Taxas         71000.0
Oregon        16000.0
Name: population, dtype: float64
In [21]:
print(obj)
 
0    4
1    7
2   -5
3    3
dtype: int64
In [22]:
obj.index = ['Bob','Steve','Jeff','Ryan']
print(obj)
 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
 

2. 데이터프레임(DataFrame)

  • 2차원 배열
  • R의 데이터 프레임에서 유래
  • 엑셀, 관계형 DB등에서 사용됨
  • 하나의 열이 각각의 Series객체임
 

행 인덱스/열 이름 설정 :
pandas.DataFrame(2차원 배열, index = 행 인덱스 배열, columns = 열 이름 배열)

In [23]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
In [24]:
frame
Out[24]:
  state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
In [25]:
#행 이름의 순서 변경
pd.DataFrame(data,columns=['year','state','pop'])
Out[25]:
  year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
In [26]:
frame2 =pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
frame2
Out[26]:
  year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
 

행 인덱스 변경 : DataFrame 객체.rename(index={기존 인덱스:새 인덱스, ...})
열 이름 변경 : DataFrame 객체.rename(colums={기존 이름:새 이름, ...})

In [27]:
frame2.rename(columns={'year':'YEA','state':'STA','pop':'POP','debt':'DEB'},inplace = True)
frame2.head()
frame2.rename(index={'one':'01','two':'02','three':'03','four':'04'},inplace = True)
frame2.head()
Out[27]:
  YEA STA POP DEB
01 2000 Ohio 1.5 NaN
02 2001 Ohio 1.7 NaN
03 2002 Ohio 3.6 NaN
04 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [28]:
frame2['STA']
Out[28]:
01        Ohio
02        Ohio
03        Ohio
04      Nevada
five    Nevada
six     Nevada
Name: STA, dtype: object
In [29]:
frame.year
Out[29]:
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64
In [30]:
frame2.loc['03']
Out[30]:
YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: 03, dtype: object
In [31]:
frame2.iloc[2]
Out[31]:
YEA    2002
STA    Ohio
POP     3.6
DEB     NaN
Name: 03, dtype: object
In [32]:
frame2['DEB'] = 16.5
frame2
Out[32]:
  YEA STA POP DEB
01 2000 Ohio 1.5 16.5
02 2001 Ohio 1.7 16.5
03 2002 Ohio 3.6 16.5
04 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [33]:
frame2['DEB'] = np.arange(1,13,2)
frame2
Out[33]:
  YEA STA POP DEB
01 2000 Ohio 1.5 1
02 2001 Ohio 1.7 3
03 2002 Ohio 3.6 5
04 2001 Nevada 2.4 7
five 2002 Nevada 2.9 9
six 2003 Nevada 3.2 11
In [34]:
val = pd.Series([-1.2,-1.5,-1.7],index = ['02','four','six'])
frame2['DEB'] = val
frame2
Out[34]:
  YEA STA POP DEB
01 2000 Ohio 1.5 NaN
02 2001 Ohio 1.7 -1.2
03 2002 Ohio 3.6 NaN
04 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 -1.7
In [35]:
frame2['eastern'] = frame2.STA == 'Ohio'
frame2
Out[35]:
  YEA STA POP DEB eastern
01 2000 Ohio 1.5 NaN True
02 2001 Ohio 1.7 -1.2 True
03 2002 Ohio 3.6 NaN True
04 2001 Nevada 2.4 NaN False
five 2002 Nevada 2.9 NaN False
six 2003 Nevada 3.2 -1.7 False
In [36]:
frame2['Big_state'] = (frame2.STA == 'Ohio') & (frame2.POP > 3.0)
frame2
Out[36]:
  YEA STA POP DEB eastern Big_state
01 2000 Ohio 1.5 NaN True False
02 2001 Ohio 1.7 -1.2 True False
03 2002 Ohio 3.6 NaN True True
04 2001 Nevada 2.4 NaN False False
five 2002 Nevada 2.9 NaN False False
six 2003 Nevada 3.2 -1.7 False False
In [37]:
del frame2['eastern']
frame2.head()
Out[37]:
  YEA STA POP DEB Big_state
01 2000 Ohio 1.5 NaN False
02 2001 Ohio 1.7 -1.2 False
03 2002 Ohio 3.6 NaN True
04 2001 Nevada 2.4 NaN False
five 2002 Nevada 2.9 NaN False
In [38]:
del frame2['Big_state']
frame2.head()
Out[38]:
  YEA STA POP DEB
01 2000 Ohio 1.5 NaN
02 2001 Ohio 1.7 -1.2
03 2002 Ohio 3.6 NaN
04 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
 

중첩된 딕셔너리

In [39]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [40]:
frame3 = pd.DataFrame(pop)
frame3
Out[40]:
  Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
In [41]:
frame3.T
Out[41]:
  2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
In [42]:
pd.DataFrame(pop,index=[2001,2002,2003])
Out[42]:
  Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
In [43]:
print(frame3.iloc[0,0])
print(frame3.iloc[0,1])
print(frame3.iloc[1,0])
print(frame3.iloc[1,1])
 
2.4
1.7
2.9
3.6
In [44]:
frame3.iloc[0,[0,1]]
Out[44]:
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64
In [45]:
frame3.iloc[0,0:]
Out[45]:
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64
In [46]:
pdata = {'Ohio' : frame3['Ohio'][:-1],'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)
Out[46]:
  Ohio Nevada
2001 1.7 2.4
2002 3.6 2.9
In [47]:
import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
In [48]:
titanic.head()
Out[48]:
  survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [49]:
titanic.tail()
Out[49]:
  survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True
In [50]:
df = titanic.loc[:,['age','fare']]
In [51]:
df.head()
Out[51]:
  age fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
In [52]:
df.tail()
Out[52]:
  age fare
886 27.0 13.00
887 19.0 30.00
888 NaN 23.45
889 26.0 30.00
890 32.0 7.75
In [53]:
df_add10 =df+10
df_add10.head()
Out[53]:
  age fare
0 32.0 17.2500
1 48.0 81.2833
2 36.0 17.9250
3 45.0 63.1000
4 45.0 18.0500
In [54]:
print(type(df_add10))
 
<class 'pandas.core.frame.DataFrame'>
In [55]:
df_sub = df_add10 - df
df_sub.head()
Out[55]:
  age fare
0 10.0 10.0
1 10.0 10.0
2 10.0 10.0
3 10.0 10.0
4 10.0 10.0
In [56]:
obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
print(index)
index[1:]
 
Index(['a', 'b', 'c'], dtype='object')
Out[56]:
Index(['b', 'c'], dtype='object')
In [57]:
import numpy as np
labels = pd.Index(np.arange(3))
print(labels)
obj2 = pd.Series([1.5,-2.5,0], index = labels)
obj2
 
Int64Index([0, 1, 2], dtype='int64')
Out[57]:
0    1.5
1   -2.5
2    0.0
dtype: float64
In [58]:
obj2.index is labels
Out[58]:
True
In [59]:
dup_labels = pd.Index(['foo','foo','bar','bar'])    #인덱스가 중복되어도 된다
dup_labels
Out[59]:
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
In [60]:
obj = pd.Series([4.5,7.2,-5.3,3.6], index = ['d','b','a','c'])
obj
Out[60]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [61]:
obj2 = obj.reindex(['d','b','a','c','e'])
obj2
Out[61]:
d    4.5
b    7.2
a   -5.3
c    3.6
e    NaN
dtype: float64
In [62]:
obj3 = pd.Series(['blue','purple','yellow'], index =[0,2,4])
obj3
Out[62]:
0      blue
2    purple
4    yellow
dtype: object
In [63]:
obj3.reindex(range(6),method='ffill')
Out[63]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
In [64]:
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'], columns = ['Ohio','Texas','California'])
frame
Out[64]:
  Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [65]:
frame2 = frame.reindex(['a','b','c'])
frame2
Out[65]:
  Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
In [66]:
states = ['Texas','Utah','California']
frame.reindex(columns=states)
Out[66]:
  Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [67]:
obj = pd.Series(np.arange(5.), index = ['a','b','c','d','e'])
obj
Out[67]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
In [68]:
new_obj = obj.drop('c')
new_obj
Out[68]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
In [69]:
new_obj2 = obj.drop(['c','d'])
new_obj2
Out[69]:
a    0.0
b    1.0
e    4.0
dtype: float64
In [70]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
Out[70]:
  one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [71]:
#drop은 기본적으로 행을 삭제한다. 열을 삭제할 경우에는 추가적으로 입력해야함
data.drop(['Colorado','Ohio'])
Out[71]:
  one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [72]:
data2 = data.drop('two',axis=1)
data2.drop('Utah',axis=0)
Out[72]:
  one three four
Ohio 0 2 3
Colorado 4 6 7
New York 12 14 15
In [73]:
data.drop(['two','four'],axis=1)
Out[73]:
  one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
In [74]:
data.drop('two', axis=1)
Out[74]:
  one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [75]:
data.drop(['Ohio'],axis='rows') # axis = 'rows' or 1은 생략가능
Out[75]:
  one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [76]:
data.drop('Ohio')
data
Out[76]:
  one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [77]:
data3 = data.copy()
data3.drop('Ohio',inplace=True)
data3
Out[77]:
  one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
 

인덱싱

In [78]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
Out[78]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
In [79]:
print(obj['b'],obj[1]) ; print()
print(obj[2:4]) ; print()
print(obj[['b','a','d']]) ; print()
print(obj[[1,3]]) ; print()
print(obj<2) ; print()
 
1.0 1.0

c    2.0
d    3.0
dtype: float64

b    1.0
a    0.0
d    3.0
dtype: float64

b    1.0
d    3.0
dtype: float64

a     True
b     True
c    False
d    False
dtype: bool

In [80]:
obj['b':'c'] = 5
obj
Out[80]:
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
In [81]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
Out[81]:
  one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [82]:
data['two']
Out[82]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
In [83]:
data[['three','one']]
Out[83]:
  three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
In [84]:
data[:2]
Out[84]:
  one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [85]:
data[data['three']>5]
Out[85]:
  one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [86]:
data[data<5] = 0
data
Out[86]:
  one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
 

iloc / loc의 차이

  • iloc은 번호로(인덱스 번호)
  • loc은 열이름
In [87]:
data.loc['Colorado',['two','three']]
Out[87]:
two      5
three    6
Name: Colorado, dtype: int32
In [88]:
#two의 Utah까지
data.loc[:'Utah', 'two']
Out[88]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
In [89]:
data.iloc[2,[3,0,1]]
Out[89]:
four    11
one      8
two      9
Name: Utah, dtype: int32
In [90]:
data.iloc[[1,2],[3,0,1]]
Out[90]:
  four one two
Colorado 7 0 5
Utah 11 8 9
In [91]:
data
Out[91]:
  one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [92]:
data.iloc[:,:3][data.three>5]
Out[92]:
  one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
In [93]:
ser = pd.Series(np.arange(3.))
ser
Out[93]:
0    0.0
1    1.0
2    2.0
dtype: float64
In [94]:
print(ser[:1])
print(ser.loc[:1])
print(ser.iloc[:1])
 
0    0.0
dtype: float64
0    0.0
1    1.0
dtype: float64
0    0.0
dtype: float64
 
  • np.random.randn : 평균 0, 표준편차가 1인 가우시안 정규분포 난수 matrix 생성
In [95]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[95]:
  b d e
Utah -0.729436 0.038952 -0.786652
Ohio 0.994167 0.910993 -3.028374
Texas 1.390470 -0.903524 0.862521
Oregon 1.288409 -1.179690 -2.212963
In [96]:
np.abs(frame)
Out[96]:
  b d e
Utah 0.729436 0.038952 0.786652
Ohio 0.994167 0.910993 3.028374
Texas 1.390470 0.903524 0.862521
Oregon 1.288409 1.179690 2.212963
In [97]:
f = lambda x:x.max()-x.min()

frame.apply(f)
Out[97]:
b    2.119906
d    2.090682
e    3.890894
dtype: float64
In [98]:
frame.apply(f, axis='columns')
Out[98]:
Utah      0.825604
Ohio      4.022541
Texas     2.293994
Oregon    3.501372
dtype: float64
In [99]:
def f(x):
    return pd.Series([x.min(),x.max()], index=['min','max'])
frame.apply(f)
Out[99]:
  b d e
min -0.729436 -1.179690 -3.028374
max 1.390470 0.910993 0.862521
 

Sort Ranking

In [100]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [101]:
# index를 기준으로 sort
obj.sort_index()
Out[101]:
a    1
b    2
c    3
d    0
dtype: int64
In [102]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame
Out[102]:
  d a b c
three 0 1 2 3
one 4 5 6 7
In [103]:
#행을 오름차순으로 정렬
frame.sort_index()
Out[103]:
  d a b c
one 4 5 6 7
three 0 1 2 3
In [104]:
#열을 오름차순으로 정렬
frame.sort_index(axis=1)
Out[104]:
  a b c d
three 1 2 3 0
one 5 6 7 4
In [105]:
#역행으로 정렬
frame.sort_index(axis=1,ascending=False)
Out[105]:
  d c b a
three 0 3 2 1
one 4 7 6 5
In [106]:
obj = pd.Series([4, 7, -3, 2])
In [107]:
#값이 낮은 순서대로 정렬
obj.sort_values()
Out[107]:
2   -3
3    2
0    4
1    7
dtype: int64
In [108]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
Out[108]:
  b a
0 4 0
1 7 1
2 -3 0
3 2 1
In [109]:
#b를 기준으로 작은 숫자부터 정렬
frame.sort_values(by='b')
Out[109]:
  b a
2 -3 0
3 2 1
0 4 0
1 7 1
In [110]:
# a를 정렬하고 난 후에 b를 정렬
frame.sort_values(by=['a','b'])
Out[110]:
  b a
2 -3 0
0 4 0
3 2 1
1 7 1
In [111]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj)
 
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
In [181]:
obj.rank()
Out[181]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
In [182]:
obj.rank(method='first')
Out[182]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
In [122]:
# ascending(오름차순) = False -> 내림차순으로 나타냄
# max : 동일 데이터가 여러개 있을 때 큰 것으로 표시 
obj.rank(ascending= False, method='max')
Out[122]:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
In [126]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame
Out[126]:
  b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
In [129]:
#한 행에 있는 열 값을 정렬
frame.rank(axis=1)
Out[129]:
  b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
In [186]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Out[186]:
a    0
a    1
b    2
b    3
c    4
dtype: int64
In [187]:
obj.index.is_unique
Out[187]:
False
In [188]:
obj['a']
Out[188]:
a    0
a    1
dtype: int64
In [189]:
obj['c']
Out[189]:
4
In [131]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[131]:
  0 1 2
a 0.952780 -0.107355 -2.503290
a 1.798242 0.003292 -0.084674
b 0.242779 -0.224056 -1.409196
b 0.358826 1.316739 0.834168
In [132]:
df.loc['b']
Out[132]:
  0 1 2
b 0.242779 -0.224056 -1.409196
b 0.358826 1.316739 0.834168
In [133]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df
Out[133]:
  one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [134]:
df.sum()
Out[134]:
one    9.25
two   -5.80
dtype: float64
In [135]:
df.sum(axis='columns')
Out[135]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [138]:
# skipna : NaN값을 제외하고 계산
df.mean(axis='columns',skipna=False)
Out[138]:
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
In [139]:
# 최대, 최소값을 가지고 있는 index를 반환
df.idxmax()
Out[139]:
one    b
two    d
dtype: object
In [140]:
# 누적합
df.cumsum()
Out[140]:
  one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
In [200]:
df.describe()
Out[200]:
  one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
 

Unique Values, Value Counts, and Membership

In [202]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj
Out[202]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [203]:
uniques = obj.unique()
uniques
Out[203]:
array(['c', 'a', 'd', 'b'], dtype=object)
In [204]:
obj.value_counts()
Out[204]:
c    3
a    3
b    2
d    1
dtype: int64
In [205]:
pd.value_counts(obj.values,sort=False)
Out[205]:
a    3
d    1
b    2
c    3
dtype: int64
In [206]:
obj
Out[206]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [207]:
# b,c가 안에 있는지
mask = obj.isin(['b','c'])
mask
Out[207]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
In [208]:
obj[mask]
Out[208]:
0    c
5    b
6    b
7    c
8    c
dtype: object
In [213]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
to_match
Out[213]:
0    c
1    a
2    b
3    b
4    c
5    a
dtype: object
In [214]:
unique_vals = pd.Series(['c', 'b', 'a'])
unique_vals
Out[214]:
0    c
1    b
2    a
dtype: object
In [216]:
pd.Index(unique_vals).get_indexer(to_match)
Out[216]:
array([0, 2, 1, 1, 0, 2], dtype=int64)
In [233]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
In [234]:
data
Out[234]:
  Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [235]:
#각 열의 숫자가 몇 번 카운트되었는지
result = data.apply(pd.value_counts).fillna(0)
result
Out[235]:
  Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
 

isin : Series의 각 원소가 넘겨받은 연속된 값에 속하는 지 나타내는 bool배열을 반환
match : 각 값에 대해 유일한 값을 담고 있는 배열에서의 정수 색인을 계산.
unique : Series에서 중복되는 값을 제거하고 유일한 값만 포함하는 배열을 반환
value_count : Series에서 유일값에 대한 색인과 두수를 계산 (도수는 내림차순)

In [236]:
data['Qu1'].value_counts()[:2]
Out[236]:
4    2
3    2
Name: Qu1, dtype: int64
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important; }</style>"))
 
 
In [ ]:
 

'Python' 카테고리의 다른 글

09.Python Pandas - csv,xlsx  (0) 2020.09.10
Python pandas 음식점 예제  (0) 2020.09.09
07.Python matplotlib  (0) 2020.09.08
06.Python(Numpy 행렬)  (0) 2020.09.08
05.Python Numpy  (0) 2020.09.07

댓글