Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

원시인

Transforming DataFrames 본문

가짜연구소/Data Scientist with Python

Transforming DataFrames

MJ.W 2021. 10. 23. 16:14

안녕하세요ㅎㅎ 오늘은 판다스를 이용한 데이터 변환에 대해서 알아보겠습니다.

Pandas는 데이터 조작을 위한 Python 패키지

Pandas는 두 가지 필수 Python 패키지인 Numpy 및 Matplotlib 위에 빌드됩니다.

Numpy는 Pandas가 사용하는 간편한 데이터조작을 위해 다차원 배열 객채를 제공

데이터를 저장하고 Matplotlib 에는 Pandas 가 활용하는 강력한 데이터 시각화 기능이 있습니다.

Pandas 데이터프레임은 tabula(표형식)로 돼있고

여네의 모든 값은 모두 동일한 데이터 유형, 각 열에는 다른 데이터 타입이 가능합니다.

Transforming DataFrames

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

bike = pd.read_csv ("Desktop/bike.csv")
bike

	datetime	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
0	2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0000	3	13	16
1	2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0000	8	32	40
2	2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0000	5	27	32
3	2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0000	3	10	13
4	2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0000	0	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...
10881	2012-12-19 19:00:00	4	0	1	1	15.58	19.695	50	26.0027	7	329	336
10882	2012-12-19 20:00:00	4	0	1	1	14.76	17.425	57	15.0013	10	231	241
10883	2012-12-19 21:00:00	4	0	1	1	13.94	15.910	61	15.0013	4	164	168
10884	2012-12-19 22:00:00	4	0	1	1	13.94	17.425	61	6.0032	12	117	129
10885	2012-12-19 23:00:00	4	0	1	1	13.12	16.665	66	8.9981	4	84	88

10886 rows × 12 columns

head() 데이터 프레임 위의 행을 반환

bike.head(3)  # 위에서 3행을 반환

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32

tail 데이터프레임 아래의 행을 반환

bike.tail(3) # 아래에서 3행을 반환

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
10883	2012-12-19 21:00:00	4	1	1	13.94	15.910	61	15.0013	4	164	168
10884	2012-12-19 22:00:00	4	1	1	13.94	17.425	61	6.0032	12	117	129
10885	2012-12-19 23:00:00	4	1	1	13.12	16.665	66	8.9981	4	84	88

info() 열 이름 , 열에 포함된 데이터 유형 및 누락된 값이 있는지 여부를 표시

bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

shape 데이터 프래임 행 수와 열 수를 포함하는 튜플 반환 (method가 아닌 속성이므로 괄호 없이 작성)

bike.shape

(10886, 12)

descibe () 평균 및 중앙값과 같은 숫자 열에 대한 일부 요약 통계를 계산

bike.describe()

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
count	10886.000000	10886.000000	10886.000000	10886.000000	10886.00000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000
mean	2.506614	0.028569	0.680875	1.418427	20.23086	23.655084	61.886460	12.799395	36.021955	155.552177	191.574132
std	1.116174	0.166599	0.466159	0.633839	7.79159	8.474601	19.245033	8.164537	49.960477	151.039033	181.144454
min	1.000000	0.000000	0.000000	1.000000	0.82000	0.760000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	0.000000	1.000000	13.94000	16.665000	47.000000	7.001500	4.000000	36.000000	42.000000
50%	3.000000	0.000000	1.000000	1.000000	20.50000	24.240000	62.000000	12.998000	17.000000	118.000000	145.000000
75%	4.000000	0.000000	1.000000	2.000000	26.24000	31.060000	77.000000	16.997900	49.000000	222.000000	284.000000
max	4.000000	1.000000	1.000000	4.000000	41.00000	45.455000	100.000000	56.996900	367.000000	886.000000	977.000000

데이터 프레임은 속성을 사용하여 접근할 수 있는 세 가지 다른 구성요소로 구성

values 2차원넘파이 배열의 데이터 값 ,colums 열값 , index 행레이블 (raw가 아니라 index에 저장되므로 주의)

bike.values

array([['2011-01-01 00:00:00', 1, 0, ..., 3, 13, 16],
       ['2011-01-01 01:00:00', 1, 0, ..., 8, 32, 40],
       ['2011-01-01 02:00:00', 1, 0, ..., 5, 27, 32],
       ...,
       ['2012-12-19 21:00:00', 4, 0, ..., 4, 164, 168],
       ['2012-12-19 22:00:00', 4, 0, ..., 12, 117, 129],
       ['2012-12-19 23:00:00', 4, 0, ..., 4, 84, 88]], dtype=object)

bike.columns

Index(['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count'],
      dtype='object')

bike.index

RangeIndex(start=0, stop=10886, step=1)

Sorting and subsetting

sort_values 행 순서를 변경 (정렬할 열 이름을 전달하여 행을 정렬 )

bike.sort_values('temp').head(3)  # ascending 디폴트

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
5499	2012-01-04 06:00:00	1	1	1	0.82	2.275	41	11.0014	0	59	59
5495	2012-01-04 02:00:00	1	1	1	0.82	0.760	34	19.0012	0	1	1
5501	2012-01-04 08:00:00	1	1	1	0.82	3.030	44	8.9981	5	310	315

bike.sort_values('temp', ascending = False).head()# decending

	datetime	season	weather	temp	atemp	humidity	windspeed	casual	registered	count
8311	2012-07-07 16:00:00	3	1	41.00	43.180	19	11.0014	102	192	294
8309	2012-07-07 14:00:00	3	2	39.36	43.180	30	8.9981	105	203	308
8307	2012-07-07 12:00:00	3	1	39.36	43.180	31	23.9994	124	218	342
8308	2012-07-07 13:00:00	3	2	39.36	43.180	31	16.9979	116	244	360
8312	2012-07-07 17:00:00	3	1	39.36	42.425	26	8.9981	103	176	279

여러 열을 선택하기 위해서는 두 쌍의 대괄호 [ [ ] ]

바깥 쪽 대괄호는 데이터프레임 부분집합을 담당, 내부 대괄호는 하위집합에 열이름 목록을 만듭니다.

bike[['season', 'humidity']].tail(3)

	season	humidity
10883	4	61
10884	4	61
10885	4	66

여러 조건의 값 반환 DF [ ( DF [' '] ) & ( DF [' '] ) ]

bike[(bike["temp"] > 30 ) & (bike["temp"] < 40  )].head(3)

	datetime	season	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
1409	2011-04-04 14:00:00	2	1	2	30.34	32.575	27	32.9975	47	76	123
1410	2011-04-04 15:00:00	2	1	1	31.16	33.335	23	36.9974	47	96	143
1411	2011-04-04 16:00:00	2	1	1	31.16	32.575	22	35.0008	59	130	189

isin 범주형 변수의 여러 값을 필터링 (데이터 부분집합을 생성해 필터링할 값 목록을 가져옴 )

huminity_c = [23 , 22 , 21]

bike[bike["humidity"].isin(huminity_c)].head(2)

	datetime	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
679	2011-02-11 15:00:00	1	0	1	1	13.12	15.91	21	11.0014	12	62	74
746	2011-02-14 11:00:00	1	0	1	1	21.32	25.00	23	16.9979	10	43	53

ADD Colmns

# DataCamp 실습 데이터 

# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_individuals col as proportion of individuals
homelessness['p_individuals']  = homelessness['individuals'] / homelessness['total']

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop'] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k" , ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state" , "indiv_per_10k"]]

저작자표시 (새창열림)

'가짜연구소 > Data Scientist with Python' 카테고리의 다른 글

Aggregating DataFrames - count 계수를 사용하여 범주 형 데이터를 요약 , Grouped summary statistics (0)	2021.10.26
Aggregating DataFrames - Summary statistics (0)	2021.10.26
Matplotlib(파이썬 시각화) (0)	2021.10.14
NumPy (넘파이) (0)	2021.10.14
Python Loops(반복문) (0)	2021.10.11

'가짜연구소/Data Scientist with Python' Related Articles

Comments

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

원시인

Transforming DataFrames 본문

Transforming DataFrames

Transforming DataFrames

head() 데이터 프레임 위의 행을 반환

tail 데이터프레임 아래의 행을 반환

info() 열 이름 , 열에 포함된 데이터 유형 및 누락된 값이 있는지 여부를 표시

shape 데이터 프래임 행 수와 열 수를 포함하는 튜플 반환 (method가 아닌 속성이므로 괄호 없이 작성)

descibe () 평균 및 중앙값과 같은 숫자 열에 대한 일부 요약 통계를 계산

데이터 프레임은 속성을 사용하여 접근할 수 있는 세 가지 다른 구성요소로 구성

Sorting and subsetting

sort_values 행 순서를 변경 (정렬할 열 이름을 전달하여 행을 정렬 )

여러 열을 선택하기 위해서는 두 쌍의 대괄호 [ [ ] ]

여러 조건의 값 반환 DF [ ( DF [' '] ) & ( DF [' '] ) ]

isin 범주형 변수의 여러 값을 필터링 (데이터 부분집합을 생성해 필터링할 값 목록을 가져옴 )

ADD Colmns

'가짜연구소 > Data Scientist with Python' 카테고리의 다른 글

티스토리툴바