R | Grayrecord Technow Blog

“採血データを使った心不全予測"のコンペに参加したので、解法を示します。まず、基本的に、2つの分類の予測問題は元々慣れていたので基本的なフレームとしては以下の通りです。探索的データ分析ベースラインモデルの作成予測モデルの作成探索的データ分析まず、最初にデータを一応概観してみました。発売中の技術同人誌 “Pythonによる探索的データ分析クックブック“でも触れている、ydata-profilingを使用しています。 import os import sys import pandas as pd import polars as pl import pyarrow as pa import numpy as np import matplotlib.pyplot as plt import seaborn as sns from ydata_profiling import ProfileReport train_df = pd.read_csv("../data/train.csv") test_df = pd.read_csv("../data/test.csv") profile = ProfileReport(train_df, title="Heart Failure Report") profile.to_file("../profile/heart_failure_report.html") ベースラインモデルの作成基本線となるベースラインモデルを作成しました。ベースラインモデルは文字通りベースラインモデルなので、複雑なものは避けるのがセオリーです。今回は二つの分類をするタイプなので、一般化線形回帰でロジスティック回帰に持ち込むのを基本としました。また、stepwiseなどの容易さから、一旦、Rでモデリングを進めました。コードとしては以下のシンプルきわまるものです。 require(dplyr) require(readr) require(ggplot2) require(pROC) train.df <- read.csv("../data/train.csv") test.df <- read.csv("../data/test.csv") train.df <- train.df %>% mutate( anaemia = as.factor(anaemia), diabetes = as.factor(diabetes), high_blood_pressure = as.factor(high_blood_pressure), sex = as.factor(sex), smoking = as.factor(smoking), target = as.factor(target) ) require(MASS) model <- glm(formula = target ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time, data = train.df, family = binomial) model.opt <- stepAIC(model) pred.df <- predict(model.opt, train.df, type = "response") roc_curve <- roc(train.df$target, pred.df) auc_value <- auc(roc_curve) ar_value <- (auc_value - 0.5) * 2 print(paste("AR値:", ar_value)) cat("Length of Cumulative_Percentage:", length(seq(0, 1, length.out = nrow(train.df))), "\n") cat("Length of Predicted_Positive:", length(sort(pred.df, decreasing = TRUE)), "\n") cat("Length of Actual_Positive:", length(sort(as.numeric(train.df$target), decreasing = TRUE)), "\n") sorted_predictions <- sort(pred.df, decreasing = TRUE) # 予測確率を降順に sorted_targets <- sort(as.numeric(as.character(train.df$target)), decreasing = TRUE) # 実ターゲットを降順に n <- min(length(sorted_predictions), length(sorted_targets)) cumulative_data <- data.frame( Cumulative_Percentage = seq(0, 1, length.out = n), Predicted_Positive = cumsum(sorted_predictions[1:n]), Actual_Positive = cumsum(sorted_targets[1:n]) ) ggplot(cumulative_data, aes(x = Cumulative_Percentage)) + geom_line(aes(y = Predicted_Positive, color = "モデル予測")) + geom_line(aes(y = Actual_Positive, color = "実データ")) + scale_y_continuous(name = "累積正例率", limits = c(0, max(cumulative_data$Predicted_Positive, cumulative_data$Actual_Positive))) + scale_x_continuous(name = "累積パーセンテージ", limits = c(0, 1)) + labs(title = "CAP図") + theme_minimal() 最終的には以下のCAP図になりました、 ...