Data Handling - Pandas

부천대 IoT 응용소프트웨어 수강하는 학생들을 위한 자료로 교재인 권철민님의 파이썬 머신러닝 완벽 가이드를 참고 하였습니다.

Pandas란

금융분석가 웨스매키니(Wes McKinney)가 데이터 처리(핸들)를 위한 라이브러리
판다스는 리스트,컬렉션,넘파이 등 내부 데이터 뿐만 아니라 CSV 파일을 쉽게 DataFrame으로 변경해 데이터 가공/분석을 편리하게 수행하는데 도와준다.

판다스의 핵심 객체는 DataFrame이다. DataFrame은 여러개의 열과 행으로 이뤄진 2차원 데이터를 담는 데이터 구조체이다. DataFrame은 NumPy의 ndarray로 만들어져 있다.

index는 RDB의 PK와 유사하게 오직 데이터 프레임과 Series의 식별용도이고 연산이 불가능하다.

Pandas 구성요소

index : 각 row를 고유하게 구분할 수 있는 key . 데이터 값의 위치를 나타내는 이름표 역할
column: 열
row: 행
Series: 1개의 column 값으로 구성된 1차원 데이터 셋. 명확히 말하면 column 값으로 이루어져 있고, column명은 존재하지 않
Dataframe: colum * Rows 이루어진 2차원 데이터 . Series가 여러개 모인 데이터

기본 API

read_csv() : csv파일을 DaraFrame으로
head() : 데이터의 앞부분을 확인. 주로 데이터를 간단하게 확인할때 쓴다. (index는 순수하게 값만 확인하는 것)
shape : 데이터의 크기 (row, column 갯수(index는 미포함 )) 확인 (row갯수,column갯)
info() : 데이터의 컬럼, 데이터 타입,건수 정보 확인
describe() : 데이터값들의 평균,표준편차,4분위 분포도를 확인
Value_counts() : 단일 컬럼(Series)의 개별 분포도 확인
Sort_values() : 컬럼기준(by)으로 오름차순(True), 내림차순(False)정렬

실습을 위해 Kaggle 사이트의 타이타닉 데이터 셋을 다운(Download All)받는다.

(Kaggle 아이디가 없는 분은 Register를 눌러 가입하시고 competition에 참가하여 데이터를 다운 받으시길 바랍니다.)

import pandas as pd #pandas 불러오기

train = pd.readcsv('titanic_train.csv' , sep=',' )

infer옵션: 데이터를 열어보 첫줄에 컬럼명이 있는지 유추하여 컬럼명이 있으면 컬럼으로 만든다.

head()는 데이터의 앞부분만 보여주며, default는 5로 5개의 record를 보여준다. 괄호안에 숫자를 임의로 지정하면 내가 볼 수 있는 만큼 볼 수 있다. 보통 EDA를 하기 전, 간단하게 데이터를 확인할때 tail()과 함께 쓰인다. 유의사항은 index는 컬럼명이 할당되지 않고 단순히 row number로 사용된다.

train.head()

shape은 데이터의 크기를 확인하는 것으로, (row, column) 형태로 앞에 숫자는 row의 갯수, 뒤에 숫자는 column의 갯수를 확인한다. index는 포함하지 않으며, 주로 데이터 핸들링을 하면서 수시로 데이터를 확인할 때 head()와 함께 쓴다.

train.shape

info()는 데이터의 컬럼, 데이터 타입,건수 정보 확인하는 것으로, object는 문자열(string)이라 생각하면 된다.

train.info() #age나이가 891개중 714개이면 Null값이 있다

describe()는 데이터값들의 평균, 표준편차, 4분위 분포도를 확인하는 것으로, 각 컬럼들의 데이터 분포를 한번에 확인할 수 있다.

titanic_tr.describe()#PassengerID는 말그대로 ID , Survivd,Pclass는 Cateogry형 변수라서 무의미하다.

value_counts()

titanic_tr['Pclass'].value_counts()

sort_values(by="변수명")

titanic_tr.sort_values(by='Survived', ascending=False) #내림차순

titanic_ts.sort_values(by='Pclass', ascending=False) #오름차순

sort_values(by="[변수명1,변수명2]",ascending=True/Fasle) : 원하는 컬럼들만 지정 후 옵션 값을 이용해 부분정

titanic_tr[['Embarked','Age','Pclass']].sort_values(by=['Pclass','Age'],ascending=False)

titanic_tr[['Embarked','Age','Pclass']].sort_values(by=['Pclass','Age'],ascending=[False,True])#Pclass는 내림차순, Age는 오름차순

DataFrame에서 Column 추가 방법

#0의 값을 가진 Age_0(나이) 컬럼을 추가

titanic_tr['Age_0']=0

titanic_tr.head(3)

#자녀(SibSp)와 부모(Parch) 컬럼을 더하고 +1 하면 가족구성원 숫자가 나온다.

titanic_tr['Family_Num'] = titanic_tr['SibSp'] + titanic_tr['Parch']+1

#기존 컬럼을 업데이트 하는 방법

titanic_tr['Family_Num']=titanic_tr['Family_Num']+5

titanic_tr[['Family_Num','SibSp','Parch']].head()

DataFrame 삭제방법

axis(0: 행(row), 1:열(column))에 따른 삭제
inplace(True/False) 적용/미적용에 따른 데이터 변화를 확인

drop('변수명', axis=0/1)

titanic_tr.drop('Age_0', axis=1 ) #column을 기준으로 삭제

titanic_tr.head()

titanic_tr = titanic_tr.drop([0,1,2], axis=0)

titanic_tr.shape # 888개로 줄어든것을 확인할 수 있다.

drop을 시킨다고 원본이 삭제되지 않음을 확인할 수 있다. 따라서 drop을 한 다음 변수 지정 후 다시 확인해보면 삭제된 것을 확인할 수 있다.

titanic_drop_tr = titanic_tr.drop('Age_0', axis=1 )

titanic_drop_tr.head()

여러 Column을 삭제시킬 경우, 리스트 형태로 담아서 삭제시키면 된다.

titanic_drop_tr=titanic_tr.drop(['Age_0', 'Family_Num'], axis=1)

titanic_drop_tr.head()

drop('변수명', axis=1, inplace=True/False) : inplace가 True인 경우 ,삭제대Column대상들이 삭제가 반영된 후 출력된 것을 확인할 수 있다.

titanic_tr.drop(['Pclss', 'Family_Num'], axis=1, inplace=True)

titanic_tr.head()

Data Selection & filtering

loc /iloc : 명칭(column )/위치 (가로축)기반 인덱싱 : iloc경우 ,"character" 들어갈 경우 error발생
boolean indexing: 조건식([boolean 값 Series형태])에 따른 필터링

예제 코드

titanic_tr=pd.read_csv('train.csv')

#2등 탑승자들만 져오기

titanic_Pclass = titanic_df[titanic_df['Pclass'] ==2]

print(type(titanic_Pclass)) #데이터 프레임형태

titanic_Pclass

`### [브레킷]응용`

#2등급 탑승객들의 2컬럼(age, survived)의 형태

titanic_tr[titanic_tr['Pclass'] ==2][['Age','Survived']].head()

#loc[조건식,['column1','column2']] 예제코드는 생존자들 가운데 등급과 탑승위치 확

titanic_tr.loc[titanic_tr['Survived'] == 1, ['Pclass','Embarked']].head(10)

#여러 필요조건에 따른 데이터 불러오기 (but bad code)

titanic_tr[ (titanic_tr['Survived'] > 0) & (titanic_tr['Pclass']==1) & (titanic_tr['Sex']=='female')]

#가독성 향상을 위한 여러 조건을 변수로 할당하여 데이터 불러오기 (Good code)

survive = titanic_tr['Survived'] > 0

pcla = titanic_tr['Pclass']==1

sex_fem = titanic_tr['Sex']=='female'

titanic_tr[ survive & pcla & sex_fem].head()

titanic_tr[ survive & pcla & sex_fem][['Age','Embarked']].head()

Aggregation & Groupby

Aggregation: DataFrame/Series에서 sum(),max(),min(),mean(),median()등 집합연산 수행
groupby(by='column명') :groupby 연산 수행

aggregation은 axis가 0이냐 1이냐 어떻게 적용하느냐에따라 결과가 달라지게 된다.

axis =0 이면 행축방향 으로 연산을 수행하는데 ,위의 그림을 예제로 활용해보면 axis를 행방향대로 더한다고 가정하면 Age 컬럼의 20,56,12를 다 더했을때 88의 결과가 나온다.
axis=1 이면 열 방향으로 연산을 수행하는데 , axis=0과 반대로 컬럼방향대로 연산을 수행하여 Age의 10과 Fare 컬럼의 10000을 모두 더한 10020의 결과가 나온다.

#데이터 불러오기

import pandas as pd

train= pd.read_csv('train.csv')

#데이터형태 확인

print(train.shape) #행, 열 갯수 확인

train.count() # column 당 갯수(row) 확인

Mean(평균),Median(중앙값),Sum(합계),Max(최대),Min(최소)

titanic_tr[['SibSp', 'Parch']].mean(axis=1) #열기준. 평균 열기준으로는 잘쓰진 않는다

titanic_tr[['SibSp', 'Parch']].mean(axis=0) #행기준. 평균

titanic_tr[['SibSp', 'Parch']].median() #중앙값

titanic_tr[['SibSp', 'Parch']].sum(axis=0)

titanic_tr[['SibSp', 'Parch']].max() #최대

titanic_tr[['SibSp', 'Parch']].min() #최소

DataFrame은 Group by 연산을 위해 groupby()메소드를 제공한다. groupby() 메소드는 by인자로 group by 하려는 컬럼명을 입력 받으면 DataFrameGroupBy 객체를 반환하게되고, 반환된 DataFrameGroupBy 객체에 aggregation 함수를 수행한다.