2020 DATATHON EPA AIR QUALITY

대회 : https://www.kaggle.com/competitions/phase-ii-widsdatathon2022

Excellence in Research Award (Phase II)

WiDS Datathon Further Examines the Impacts of Climate Change

www.kaggle.com

노트북 : https://www.kaggle.com/code/cardata/2020-datathon-epa-air-quality

2020 DATATHON EPA AIR QUALITY

Explore and run machine learning code with Kaggle Notebooks | Using data from Excellence in Research Award (Phase II)

www.kaggle.com

배경

대회 설명

의료, 에너지, 환경 보호 등에 대한 데이터셋 중 하나를 선택하여, 다양한 관점에서 기후 변화에 대해 살펴보는 데이터톤

실제 과제를 해결할 기회를 갖게 되고, 연구 보고서를 제출함

데이터셋

- Datathon_EPA_Air_Quality_Demographics_Meteorology_2020 : 2020 년 미국의 일일 Air Quality (대기질) 측정값

- EPA (United State Environmental Protection Agency) 는 PM2.5 와 오존을 동시에 특정한다.

- 해당 데이터는 위치, 기상학적, 인구통계학적 정보, 기타 주요 대기질 오염 물질 농도 등이 포함되어 있다.

- 모든 데이터는 AQS (Air Quality System) 에서 다운로드 되었고, EPA 에서 가져온 네 가지 인구통계학적 매개변수 (people of color, low income, linguistically isolated, less than high school education) 을 제외하였다.

- 해당 데이터에 있는 인구통계학적 매개변수는 인구조사국에서 정의하는 "block group" level (일반적으로 600~3,000명 사이의 인구를 포함하는 지역)이고, monitor location 을 포함하는, block group에 대해 fractiaonal units 으로 나열되어 있다.

df = pd.read_excel('/kaggle/input/phase-ii-widsdatathon2022/epa/epa/Datathon_EPA_Air_Quality_Demographics_Meteorology_2020.xlsx')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133418 entries, 0 to 133417
Data columns (total 22 columns):
 #   Column                            Non-Null Count   Dtype         
---  ------                            --------------   -----         
 0   AQS_ID                            133418 non-null  object        
 1   LATITUDE                          133418 non-null  float64       
 2   LONGITUDE                         133418 non-null  float64       
 3   COUNTY                            133418 non-null  object        
 4   STATE                             133418 non-null  object        
 5   CBSA                              121424 non-null  object        
 6   PEOPLE_OF_COLOR_FRACTION          133404 non-null  float64       
 7   LOW_INCOME_FRACTION               133404 non-null  float64       
 8   LINGUISTICALLY_ISOLATED_FRACTION  133404 non-null  float64       
 9   LESS_THAN_HS_ED_FRACTION          133404 non-null  float64       
 10  DATE                              133418 non-null  datetime64[ns]
 11  TEMPERATURE_CELSIUS               71845 non-null   float64       
 12  RELATIVE_HUMIDITY                 49413 non-null   float64       
 13  WIND_SPEED_METERS_PER_SECOND      59659 non-null   float64       
 14  WIND_DIRECTION                    60329 non-null   float64       
 15  PM25_UG_PER_CUBIC_METER           133418 non-null  float64       
 16  OZONE_PPM                         133418 non-null  float64       
 17  NO2_PPB                           64034 non-null   float64       
 18  CO_PPM                            40965 non-null   float64       
 19  SO2_PPB                           47950 non-null   float64       
 20  LEAD_UG_PER_CUBIC_METER           421 non-null     float64       
 21  BENZENE_PPBC                      2806 non-null    float64       
dtypes: datetime64[ns](1), float64(17), object(4)
memory usage: 22.4+ MB

지역 관련

AQS_ID : Air Quality System ID

LATITUDE / LOGITUDE / COUNTRY / STATE / CBSA : 지역 관련 정보

* CBSA (Core-Based Statistical Area) : 미국에서 인구 통계 및 경제 데이터를 분석하기 위해 사용하는 지역 구분 단위

지역에 해당하는 인구통계학적 매개변수

PEOPLE_OF_COLOR_FRACTRION : 특정 지역의 유색인종 비율

LOW_INCOME_FRACTION : 특정 지역의 저소득층 비율

LINGUISTICALLY_ISOLATED_FRACTION : 특정 지역의 언어적으로 고립된 가구의 비율

LESS_THAN_HS_ED_FRACTION : 특정 지역에서 고등학교 졸업 미만의 학력을 가진 인구의 비율

온도 관련

DATE : 날짜

TEMPERATURE_CELSIUS : 기온

RELATIVE_HUMIDITY : 상대 습도

WIND_SPEED_METERS_PER_SECOND : 바람의 속도 (m/s)

WIND_DIRECTION : 바람의 방향

오염 관련

PM25_UG_PER_CUBIC_METER : 공기중 PM2.5 (지름이 2.5 마이크로 미터 이하인 미세먼지)의 농도 (μg/m³)

OZONE_PPM: 공기 중 오존의 농도 (ppm : parts per million)

NO2_PPB : 공기 중 이산화질소(NO_2)의 농도 (ppb : parts per billion)

CO_PPM : 공기 중 일산화탄소(CO)의 농도 (ppm)

SO2_PPB : 공기 중 이산화황(SO_2)의 농도 (ppb)

LEAD_UG_PER_CUBIC_METER : 공기 중 납(Pb) 의 농도 (μg/m³)

BENZENE_PPBC : 공기 중 벤젠의 농도 (ppbC : ppb 의 탄소(C) 농도)

분포 확인

describe 의 include option 을 default 로 했을 때는 숫자형 데이터만 포함되지만,

all 로 했을 때는 모든 자료형의 데이터가 포함된다.

따라서 Statistics 를 확인했을 때 NaN 값이 존재한다.

결측치 처리

all_count = df.shape[1]
categorical_count = df.select_dtypes(include='object').shape[1]
numerical_count = df.select_dtypes(exclude='object').shape[1]

print(f'# of all vars: {all_count}')
print(f'# of categorical vars: {categorical_count}')
print(f'# of numerical vars: {numerical_count}')

# of all vars: 22
# of categorical vars: 4
# of numerical vars: 18

지역

print(df.select_dtypes(include=['object']).nunique())

AQS_ID    513
COUNTY    347
STATE      52
CBSA      254
dtype: int64

데이터의 각 col 별로, null 인 것의 비율을 확인해보자

missing_values_percentage = df.isnull().mean() * 100
missing_values_percentage_sorted = missing_values_percentage.sort_values()

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values_percentage_sorted, y=missing_values_percentage_sorted.index)
plt.title('Percentage of Missing Values in Each Column (Ascending)')
plt.xlabel('Percentage of Missing Values')
plt.ylabel('Columns')
plt.show()

sns.heatmap(df.isnull(),cbar=False)

fillna

# 숫자X 이면 가장 많이 나오는 값, 숫자o이면 평균값
for col in df.columns:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].mean(), inplace=True)

# null 인 거 확인 (혹시나 싶어서 두 번 체크)
missing_values = df.isnull().sum()
missing_values_after = df.isnull().sum()

(missing_values, missing_values_after)

missing_df =  df.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values"})
missing_df["% of Missing Values"] = round((missing_df["Total No. of Missing Values"]/len( df))*100,2)
missing_df

이전에 실행했던 코드를 다시 실행하면, 전부 0이 뜬다 -> 결측치 해결!

중복된 row 제거

df[df.duplicated(keep=False)]
df.drop_duplicates(inplace=True)

- 해당 노트북에서는 중복 row 없어서 그대로 나오긴 함

Univariate Analysis

cat_cols = df.select_dtypes(include='object').columns.tolist()

# 각 Categorical column 에 대해, unique value 의 수를 포함한 df 생성
cat_df = pd.DataFrame(df[cat_cols].melt(var_name='column', value_name='value')
                      .value_counts()).rename(columns={0: 'count'}).sort_values(by=['column', 'count'])

display(df[cat_cols].describe())
display(cat_df)

데이터타입이 Object 인 열에 대해 describe

df.describe(include='O').T

unique 한 값 갯수로 정렬

df.nunique().sort_values()

EDA

skewness : 분포가 대칭이 아닌, 치우친 정도를 뜻함

skewness = df.select_dtypes(include=['int64', 'float64']).skew()

num_cols_count = len(df.select_dtypes(include=['int64', 'float64']).columns)

num_rows = (num_cols_count + 3) // 4
num_cols = min(4, num_cols_count)

# numerical columns 에 대한 histograms Plot, distributions and identify anomalies 를 visualize 하기 위함
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

for i in range(num_rows):
    for j in range(num_cols):
        col_idx = i * num_cols + j
        if col_idx < num_cols_count:
            col = df.select_dtypes(include=['int64', 'float64']).columns[col_idx]
            axes[i, j].hist(df[col], bins=15, color='green', alpha=0.7)
            axes[i, j].set_title(f'{col}')
            axes[i, j].set_xlabel(col)
            axes[i, j].set_ylabel('Frequency')
            
            skew_val = skewness[col]
            
            axes[i, j].text(0.5, 0.5, f'Skewness: {skew_val:.2f}', horizontalalignment='center',
                            verticalalignment='center', transform=axes[i, j].transAxes, fontsize=10, color='red')

plt.tight_layout()
plt.show()

print("Skewness:")
print(skewness)

df.plot(kind='box', rot=45,color='green')

plt.show()

numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Outliers 를 확인하기 위해, 각 numerical feature 에 대한 boxplots Plot
for column in numeric_cols:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=df[column],palette='rainbow')
    plt.title(f'Boxplot of {column}')
    plt.show()

Multivariate Analysis

# Correlation matrix

# DataFrame numeric columns 선택
numeric_columns = df.select_dtypes(include=['number'])

# Correlation matrix 계산
correlation_matrix = numeric_columns.corr()

# Correlations 을 visualize 하기 위해, heatmap 생성
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='rainbow', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')

# Heatmap Plotting
# numeric columns 선택
numeric_columns = df.select_dtypes(include=['int64', 'float64'])

# correlation matrix 계산
corr_matrix = numeric_columns.corr()

# Filter correlation matrix (values > 0.5 or value < -0.5 를 포함)
corr_matrix_filtered = corr_matrix[(corr_matrix > 0.5) | (corr_matrix < -0.5)]

# 필터링된 correlation values 를 사용하여 heatmap Plot 하기
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix_filtered, annot=True, cmap='rainbow', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Features (|Correlation| > 0.5)')
plt.show()

# categorical features 탐색
for column in df.select_dtypes(include=['object']):
    sns.countplot(x=column, data=df,palette='rainbow')
    plt.tight_layout()
    plt.show()

# categorical features 탐색
for column in df.select_dtypes(include=['object']):
    plt.figure(figsize=(10, 6))
    ax = sns.countplot(x=column, data=df,palette='rainbow')
    
    # 각 bar 에 count 와 percentage 주석 추가
    total = len(df[column])
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total)
        count = p.get_height()
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        ax.annotate(f'{count}\n{percentage}', (x, y), ha='center', va='bottom')
    
    plt.title(f'Count Plot for {column}', fontsize=15)
    plt.xlabel(column, fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=45) 
    plt.tight_layout()
    plt.show()

참고

https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python

GitHub - PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python: Hands-on Exploratory Data Analysis with Python, publish

Hands-on Exploratory Data Analysis with Python, published by Packt - PacktPublishing/Hands-on-Exploratory-Data-Analysis-with-Python

github.com

저작자표시 (새창열림)

2020 DATATHON EPA AIR QUALITY

배경

결측치 처리

Univariate Analysis

EDA

Multivariate Analysis

댓글

이 글 공유하기

티스토리툴바