파이썬 스터디 ver3. 17주차

반갑습니다.. 이번 주에도 프로젝트 내용 공유에서 끝날 예정입니다.

저번 주에 이어서, 제가 보려던 거, 가공하려던 목표가 뭐였냐면

"그래서 얻을 수 있는(+중복/유효하지 않은 값들을 제외한) 전체 데이터의 종류는 몇 개일까?" 입니다.

유효하다거나 유효하지 않에는 저의 주관적인 판단이 적지 않게 담겨있습니다만, 이렇게라도 하지 않으면 컬럼 수가 어마어마하게 많아지고.. 가공할 것들은 그만큼 많아진다구요.. ㅠ-ㅠ ('효과적인 배리어&힐 사용'과 '전체 힐 사용', '전체 팀 힐', '치유된 유닛 수' 같은 값들은 유사점을 가질 수밖에 없잖아요...)

트리 탐색 등의 방법을 거쳐 값들을 먼저 본 후 -> 1게임의 여러 요소들을 가져온 결과

이런 결과들을 얻었습니다. 총 82개 컬럼인데, CS관련 지표 두 개를 합쳐서 보았었어서 83개나 마찬가지입니다. 83개 피처 모두가 '승리'에 영향을 줄 가능성은 낮다고 보고, 저 중 '승리'에 제일 중요한-큰- 영향을 미치는 피처 30개만 따로 뽑아보고자 트리 관련 분류기 모델을 사용하고자 했습니다.

피처 자체를 탐색하는 과정에서는 총 10000개 게임 데이터 중 1개의 게임 데이터만을 따로 뽑아왔었기에, 피처를 선택할 때에는 그보다 조금 더 큰 범위인 100개의 게임 데이터(=파일 1개) 사용했습니다. 그 과정을 따라가보자면

import json
with open('/content/drive/MyDrive/lol_matches/match1.json', 'r') as file:
    match1 = json.load(file) #기존에 만든 파일(=api로 불러온 게임 정보들)은 json형태
    
import pandas as pd
match1_df = pd.DataFrame() #json의 복잡한 구조에 담겨있는 정보들을 df컬럼으로 정리할 예정

for x in range(100): #matchId 컬럼부터 생성
    for y in range(10):
        match1_df.loc[(x*10)+y, 'matchId'] = match1['games'][x]['metadata']['matchId']

match1_df #len(match1_df)는 1000(=100개의 게임 x (판당)10명의 플레이어)

파일(match1)을 불러온 뒤 빈 df에 컬럼을 하나씩 추가합니다. 디렉토리(?)가 너무 복잡해 함수를 만들기도 했습니다.

#participants안에 있는 것들이 너무 많으니까 걍 함수를 만들어주자
def participant_column(df, data, columns):
    for col in columns:
        try: df.loc[(i*10)+j, col] = data['games'][i]['info']['participants'][j][col]
        except: df.loc[(i*10)+j, col] = 0

#challenge안에 있는 것들도 마찬가지로 한수 만들어주자
def challenge_column(df, data, columns):
    for col in columns:
        try: df.loc[(i*10)+j, col] = data['games'][i]['info']['participants'][j]['challenges'][col]
        except: df.loc[(i*10)+j, col] = 0

for i in range(len(match1['games'])): #게임 1개씩 보기
    for infos in match1['games'][i]['info'].keys(): #각 게임의 info키들 중
        tp_keys = infos

        if tp_keys == 'participants': #participants와 teams를 찾아
             for j in range(10):

                 #우선 게임시간부터 만들자 #시간가공은 나중에
                 match1_df.loc[(i*10)+j, 'gameDuration'] = match1['games'][i]['info']['gameDuration']

                 #position정보 만들고
                 if match1['games'][i]['info']['participants'][j]['teamPosition'] == match1['games'][i]['info']['participants'][j]['individualPosition']:
                     match1_df.loc[(i*10)+j, 'position'] = match1['games'][i]['info']['participants'][j]['teamPosition']
                 else:
                     match1_df.loc[(i*10)+j, 'position'] = match1['games'][i]['info']['participants'][j]['individualPosition'] + " " + match1['games'][i]['info']['participants'][j]['teamPosition']
                
                 participant_column(match1_df, match1, ['win', 'teamId', 'kills', 'deaths', 'assists'])
                 participant_column(match1_df, match1, ['firstBloodAssist', 'firstBloodKill', 'firstTowerAssist', 'firstTowerKill'])
                 participant_column(match1_df, match1, ['turretKills', 'turretTakedowns', 'turretsLost', 'inhibitorKills', 'inhibitorTakedowns', 'inhibitorsLost', 'nexusKills'])
                 participant_column(match1_df, match1, ['killingSprees', 'gameEndedInEarlySurrender', 'neutralMinionsKilled', 'totalMinionsKilled', 'objectivesStolen', 'objectivesStolenAssists', 'totalDamageDealtToChampions'])
                 participant_column(match1_df, match1, ['visionScore', 'visionWardsBoughtInGame', 'wardsPlaced', 'wardsKilled'])

                 #이제 challenges들 보면 됨(함수정리에 있는거 좀 바꿔서 가져와 사용ㅎㅏ기!)
                 challenge_column(match1_df, match1, ['bountyGold', 'goldPerMinute', 'kda'])
                 challenge_column(match1_df, match1, ['scuttleCrabKills', 'earliestDragonTakedown', 'dragonTakedowns', 'teamRiftHeraldKills', 'baronTakedowns', 'teamBaronKills', 'buffsStolen', 'epicMonsterKillsNearEnemyJungler', 'epicMonsterKillsWithin30SecondsOfSpawn', 'epicMonsterSteals', 'epicMonsterStolenWithoutSmite'])
                 challenge_column(match1_df, match1, ['firstTurretKilledTime', 'quickFirstTurret', 'kTurretsDestroyedBeforePlatesFall', 'turretPlatesTaken', 'turretsTakenWithRiftHerald', 'multiTurretRiftHeraldCount', 'hadOpenNexus', 'lostAnInhibitor'])
                 challenge_column(match1_df, match1, ['controlWardTimeCoverageInRiverOrEnemyHalf', 'controlWardsPlaced', 'visionScoreAdvantageLaneOpponent', 'visionScorePerMinute', 'wardTakedowns', 'wardTakedownsBefore20M', 'wardsGuarded', 'killAfterHiddenWithAlly'])
                 challenge_column(match1_df, match1, ['damagePerMinute', 'effectiveHealAndShielding', 'dodgeSkillShotsSmallWindow', 'enemyChampionImmobilizations', 'immobilizeAndKillWithAlly', 'killParticipation', 'killsNearEnemyTurret', 'killsUnderOwnTurret', 'legendaryCount', 'multikills', 'outnumberedKills', 'soloKills', 'teamDamagePercentage', 'maxKillDeficit'])
                 challenge_column(match1_df, match1, ['maxLevelLeadLaneOpponent', 'earlyLaningPhaseGoldExpAdvantage', 'laningPhaseGoldExpAdvantage', 'takedownsAfterGainingLevelAdvantage', 'junglerTakedownsNearDamagedEpicMonster', 'completeSupportQuestInTime'])

        else:
            pass

match1_df

구조가 조금 많이 복잡해보이죠..? 근데 이게 나름 "예쁘게" 정리한 겁니다... ㅎㅎ어쨌든 이런 과정을 거쳐, 총 82개의 컬럼을 갖는 match1_df라는 데이터프레임을 만들었습니다. describe로 값이 다 0인(=유효한 값이 없는) 컬럼을 찾아 드랍도 시켜줬구요.

display(match1_df.groupby('teamId')['win'].mean()) #레드팀 승률 0.39, 블루팀 승률 0.61
print()
display(match1_df.groupby('teamId')['firstBloodKill'].mean()) #레드팀 확률 0.088, 블루팀 확률 0.112

모델을 돌려보기 전 우선 간단하게 데이터를 살펴봤습니다. 레드팀의 승률은 0.39, 블루팀의 승률은 0.61. 그만큼 firstBloodKill(첫 킬)가 날 평균 확률 역시 레드팀 0.088, 블루팀 0.112. 정말로 저 승률차이가 유효한 것인가는 차치하고라도 생각보다 꽤 차이가 나네요..? 정말 신기했습니다.

match_data = match1_df.drop(['win', 'matchId'], axis=1) #matchId는 일반화에 필요X
match_data[['position', 'firstBloodAssist', 'firstBloodKill', 'firstTowerAssist', 'firstTowerKill', 'gameEndedInEarlySurrender']] = match_data[['position', 'firstBloodAssist', 'firstBloodKill', 'firstTowerAssist', 'firstTowerKill', 'gameEndedInEarlySurrender']].astype('category')
#카테고리 어쩌구 오류떠서 저렇게 수정했음!
match_target = match1_df['win'].astype(int) #이것도 False/True이던 것을 0과 1로 바꿈

from lightgbm import LGBMClassifier #최소한의 가공으로 feature_importance를 뽑기 위해 lightGBM 차용
X_train, X_test, y_train, y_test = train_test_split(match_data, match_target, test_size=0.2, random_state=231)
model = LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

from lightgbm import plot_importance
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(model, ax=ax)

원핫 인코딩, 표준/정규화 등을 전혀 사용하지 않고 단순히 모델 결과만을 기준으로 피처를 고른 것이기에, 일단 위의 결과를 그대로 둔 채 원핫 인코딩 적용 후 타 모델로 피처를 골랐을 때에도 비슷한 결과가 나오는 지 재차 확인할 계획이긴 합니다.

일단 이번의 성과는 크게 아래의 두 가지라 할 수 있겠습니다.

영향력이 유효하다고 할 수 있는 피처 24개(importance가 가장 큰 turetsLost의 값 312의 1/10에 못미치는, 30의 값을 갖는 'earliestDragonTakedown'까지)를 알아낸 것
'gameDuration', 'firstTurretKilledTime', 'earliestDragonTakedown'처럼 '시간'을 나타내는 값들을 어떻게 처리할지에 대한 고민: 추후 손보아도 되지 않을까 싶습니다

이 상태에서 position에 대해서만 get_dummies(카테고리화)를 진행하며 RandomForestClassifier와 RFE를 사용했습니다.

X_train = pd.get_dummies(data=X_train, columns=['position'], prefix='position')
X_test = pd.get_dummies(data=X_test, columns=['position'], prefix='position')
#미래(?)를 대비해 X_test에도 get_dummies를 사용해주자

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(random_state=0), n_features_to_select=30)
select.fit(X_train, y_train)

print(select.support_) #선택된 컬럼과 선택되지 못한 컬럼들을 알려줌
print(X_train.columns[select.support_]) #선택된 컬럼들 이름 보기

왼쪽이 select.support_, 오른쪽이 X_train.columns[select.support_]

자 그럼 이쯤에서 궁금해집니다. 과연 무작정 돌린 lightGBM으로 뽑아냈던 컬럼들과 후진선택법으로 뽑은 컬럼들은 얼마나 동일할까요? 우선 보다 정확한 결과를 위해 (제가 요약했던 24개 변수가 아닌) plot_importance에 기술된 총 57개의 컬럼들에 대해 확인해 보았습니다.

randomfr_columns = ['gameDuration', 'kills', 'deaths', 'assists', 'turretKills', 'turretTakedowns', 'turretsLost', 'inhibitorTakedowns', 'inhibitorsLost', 'totalMinionsKilled', 'totalDamageDealtToChampions', 'bountyGold', 'goldPerMinute', 'kda', 'earliestDragonTakedown',
	'teamRiftHeraldKills', 'baronTakedowns', 'teamBaronKills', 'firstTurretKilledTime', 'turretPlatesTaken', 'turretsTakenWithRiftHerald', 'lostAnInhibitor', 'controlWardTimeCoverageInRiverOrEnemyHalf', 'visionScoreAdvantageLaneOpponent', 'damagePerMinute', 'dodgeSkillShotsSmallWindow', 'immobilizeAndKillWithAlly', 'killParticipation', 'maxKillDeficit', 'maxLevelLeadLaneOpponent']
l_columns = ['gameDuration', 'teamId', 'kills', 'deaths', 'assists', 'firstBloodAssist', 'firstBloodKill', 'firstTowerKill', 'turretKills', 'turretTakedowns', 'turretsLost', 'inhibitorTakedowns', 'inhibitorsLost', 'killingSprees', 'neutralMinionsKilled', 'totalMinionsKilled', 'totalDamageDealtToChampions', 'visionScore', 'visionWardsBoughtInGame', 'wardsPlaced', 'wardsKilled',
	'bountyGold', 'goldPerMinute', 'kda', 'scuttleCrabKills', 'earliestDragonTakedown', 'teamRiftHeraldKills', 'baronTakedowns', 'teamBaronKills', 'epicMonsterKillsWithin30SecondsOfSpawn', 'firstTurretKilledTime', 'kTurretsDestroyedBeforePlatesFall', 'turretPlatesTaken', 'turretsTakenWithRiftHerald', 'lostAnInhibitor', 'controlWardTimeCoverageInRiverOrEnemyHalf', 'controlWardsPlaced',
    'visionScoreAdvantageLaneOpponent', 'visionScorePerMinute', 'wardTakedowns', 'wardTakedownsBefore20M', 'killAfterHiddenWithAlly', 'damagePerMinute', 'effectiveHealAndShielding', 'dodgeSkillShotsSmallWindow', 'enemyChampionImmobilizations', 'immobilizeAndKillWithAlly', 'killParticipation', 'killsNearEnemyTurret', 'killsUnderOwnTurret', 'multikills', 'outnumberedKills', 'soloKills',
    'teamDamagePercentage', 'maxKillDeficit', 'earlyLaningPhaseGoldExpAdvantage', 'laningPhaseGoldExpAdvantage']

tp_lst = []
for r_cols in randomfr_columns:
    if r_cols in l_columns:
        tp_lst.append(r_cols)

print(tp_lst)
print(len(tp_lst), len(l_columns)) #29, 57

... 30개 중 29개가 lightGBM의 feature_importance 플롯에 있었네요.

RFE의 n_features_to_select를 30이 아닌 60 쯤으로 두었다면 어땠을까요. 아마 57이 나왔을까요? 궁금해서 다시 돌려봤습니다. ㅎ-ㅎㅋㅋㅋㅋㅋ

select = RFE(RandomForestClassifier(random_state=132), n_features_to_select=60)
#random_state도 바꿔보았다
select.fit(X_train, y_train)

randomfr_cols_ver2 = X_train.columns[select.support_] #60개
l_columns = ['gameDuration', 'teamId', 'kills', 'deaths', 'assists', 'firstBloodAssist', 'firstBloodKill', 'firstTowerKill', 'turretKills', 'turretTakedowns', 'turretsLost', 'inhibitorTakedowns', 'inhibitorsLost', 'killingSprees', 'neutralMinionsKilled', 'totalMinionsKilled', 'totalDamageDealtToChampions', 'visionScore', 'visionWardsBoughtInGame', 'wardsPlaced', 'wardsKilled',
	'bountyGold', 'goldPerMinute', 'kda', 'scuttleCrabKills', 'earliestDragonTakedown', 'teamRiftHeraldKills', 'baronTakedowns', 'teamBaronKills', 'epicMonsterKillsWithin30SecondsOfSpawn', 'firstTurretKilledTime', 'kTurretsDestroyedBeforePlatesFall', 'turretPlatesTaken', 'turretsTakenWithRiftHerald', 'lostAnInhibitor', 'controlWardTimeCoverageInRiverOrEnemyHalf', 'controlWardsPlaced',
    'visionScoreAdvantageLaneOpponent', 'visionScorePerMinute', 'wardTakedowns', 'wardTakedownsBefore20M', 'killAfterHiddenWithAlly', 'damagePerMinute', 'effectiveHealAndShielding', 'dodgeSkillShotsSmallWindow', 'enemyChampionImmobilizations', 'immobilizeAndKillWithAlly', 'killParticipation', 'killsNearEnemyTurret', 'killsUnderOwnTurret', 'multikills', 'outnumberedKills', 'soloKills',
    'teamDamagePercentage', 'maxKillDeficit', 'earlyLaningPhaseGoldExpAdvantage', 'laningPhaseGoldExpAdvantage']

tp_lst = []
for r_cols in randomfr_cols_ver2:
    if r_cols in l_columns:
        tp_lst.append(r_cols)

print(tp_lst)
print(len(tp_lst), len(l_columns)) #52, 57

60개 중 52개가 lightGBM의 feature_importance 플롯에 있었다는 결과가 나왔습니다. 빠진 5개의 컬럼은 뭘까 궁금해서 또 봤습니다.

for i in l_columns:
    if i not in tp_lst:
        print(i)
#firstBloodAssist, firstBloodKill, firstTowerKill, epicMonsterKillsWithin30SecondsOfSpawn, earlyLaningPhaseGoldExpAdvantage

확실히 낮은 중요도를 가졌던 요소들(importance plot에서 중요도 10 미만)이 포진되어 있습니다.

하지만 여기서 멈출(?)제가 아니죠. 한 번 더 간다!

from sklearn.feature_selection import RFECV

select = RFECV(RandomForestClassifier(random_state=231), min_features_to_select=10, cv=5)
#최소 10개만 두고 5fold RFE CV를 돌리라는 뜻!! 이번엔 random_state값도 바꿔보았다!!
select.fit(X_train, y_train)
X_train.columns[select.support_], len(X_train.columns[select.support_]) #??66개가 나왔다..

음.. 조금 많이 당황스럽지만? 일단 위에서 만들어 준 tp_lst(52개)와 위(66개)의 결과를 비교해보겠습니다.

col_lst = []

randomfr_cv_cols = X_train.columns[select.support_]
for cols in randomfr_cv_cols:
    if cols in tp_lst:
        col_lst.append(cols)

print(col_lst)
print(len(col_lst)) #52

col_lst.sort(); tp_lst.sort()
col_lst == tp_lst #True!

! 결론: col_lst에 있는 52개 컬럼을 쓰자! 이렇게 피처 선택은 완료했습니다 ㅎㅅㅎ 이제 본격적인 모델 고안과 전처리들이 남았네요... 화이팅!

'STUDY' 카테고리의 다른 글

취준로그 ver0.2 (3)	2023.01.26
취준로그 ver0.1 (0)	2023.01.17
파이썬 스터디 ver3. 16주차 (3)	2022.11.16
파이썬 스터디 ver3. 15주차 (2)	2022.10.29
파이썬 스터디 ver3. 14주차 (3)	2022.10.22

메타몽이되고싶어

파이썬 스터디 ver3. 17주차

'STUDY' 카테고리의 다른 글

티스토리툴바

파이썬 스터디 ver3. 17주차

'STUDY' 카테고리의 다른 글

'STUDY' Related Articles

티스토리툴바