Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

원시인

Aggregating DataFrames - Summary statistics 본문

가짜연구소/Data Scientist with Python

Aggregating DataFrames - Summary statistics

MJ.W 2021. 10. 26. 16:40

원하 ㅎㅎ 데이터를 EDA(Exploratory Data Analysis)하는 과정에서 사용하는 요약통계 함수를에 대하여

공부하였습니다. 간단한 함수들이지만 유용하게 사용할 수 있을거라 생각이 드네요.

import numpy as np
import pandas as pd

sp = pd.read_csv ( "Desktop/SampleSuperstore.csv")
sp.head(3)

	Ship Mode	Segment	Country	City	State	Postal Code	Region	Category	Sub-Category	Sales	Quantity	Profit
0	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Bookcases	261.96	2	41.9136
1	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Chairs	731.94	3	219.5820
2	Second Class	Corporate	United States	Los Angeles	California	90036	West	Office Supplies	Labels	14.62	2	6.8714

Summary statistics 데이터 세트에 대해 요약하고 알려주는 숫자

SuperStore 데이터 셋 에서 Sales 데이터를 갖고 Summary statistics 주로 표현하겠습니다.

mean () 평균

sp["Sales"].mean()

229.85790077045777

min() 최솟값

sp["Sales"].min()

0.444

max() 최대값

sp["Sales"].max()

22638.48

median() 중앙값 반환

sp["Sales"].median()

54.489999999999995

mode() 최빈값 반환 ( 빈도수가 가장 많은 값 )

sp["Sales"].mode() # 0이 12.96번으로 sp["Sales"] 에 있는 가장 많은 데이터값

0    12.96
dtype: float64

var() 분산

sp["Sales"].var()

388434.4842038931

std() 표준편차

sp["Sales"].std()

623.245123690425

sum() 합계

sp["Sales"].sum()

2297199.8603000003

quantile()

quantile 은 수치 데이터를 크기 순서로 정렬했을 때 0% ~ 100% (0 ~ 1) 위치에 해당하는 숫자를 리턴.
즉, Q1 은 25% , Q2 는 50%, Q3 는 75%, Q4 는 100% 를 파라미터로 주면 되는데,각 퍼센트는 소수로 입력하면 됩니다.

sp["Sales"].quantile()

display(sp["Sales"].quantile(0.25), sp["Sales"].quantile(0.5) , sp["Sales"].quantile(0.75))

17.28

54.489999999999995

209.94

cumsum() , cummin() ,cummax( ) 누적합계 , 누적최소값 , 누적최대값

display ( sp["Quantity"].cumsum() , sp["Quantity"].cummin(), sp["Quantity"].cummax () )

0           2
1           5
2           7
3          12
4          14
        ...  
9989    37863
9990    37865
9991    37867
9992    37871
9993    37873
Name: Quantity, Length: 9994, dtype: int64

0       2
1       2
2       2
3       2
4       2
       ..
9989    1
9990    1
9991    1
9992    1
9993    1
Name: Quantity, Length: 9994, dtype: int64

0        2
1        3
2        3
3        5
4        5
        ..
9989    14
9990    14
9991    14
9992    14
9993    14
Name: Quantity, Length: 9994, dtype: int64

agg() method 사용자 정의 요약 통계를 계산할 수 있다. df['column'].agg(function)

sp['Sales'].agg(sum)

2297199.8603000003

sp[["Sales", "Quantity"]].agg([ 'std' , 'mean' , 'sum' ])

	Sales	Quantity
std	6.232451e+02	2.225110
mean	2.298579e+02	3.789574
sum	2.297200e+06	37873.000000

인디오 사스 !!!

저작자표시 (새창열림)

'가짜연구소 > Data Scientist with Python' 카테고리의 다른 글

Explicit indexes - 인덱스 설정 (0)	2021.10.26
Aggregating DataFrames - count 계수를 사용하여 범주 형 데이터를 요약 , Grouped summary statistics (0)	2021.10.26
Transforming DataFrames (0)	2021.10.23
Matplotlib(파이썬 시각화) (0)	2021.10.14
NumPy (넘파이) (0)	2021.10.14

'가짜연구소/Data Scientist with Python' Related Articles

Comments

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

원시인