## Random Weighted Classifier in R

7

### problema

Español

Estoy informando un clasificador ponderado aleatorio basado en las tarifas en las que aparecen 3 etiquetas en un conjunto de "tren". Quiero usar este RWC como una línea de base para otros clasificadores. Estoy haciendo esto más de 1000 iteraciones y luego calculando la media de F1, precisión y recuerdo de cada clase, además del kappa general.

¿Puede este código ejecutar más rápido / lucir mejor? EJEMPLO MÍNIMO AQUÍ:

library(caret)  random_weighted_classifier <- function(weightA, weightB, weightC){   random_number = sample(1:100,1) / 100    if(random_number <= weightA){     return("better")   }else if (random_number > weightA && random_number <= (weightA + weightB)){     return("worse")   }else if(random_number > (weightA + weightB) && random_number <= (weightA + weightB + weightC)){     return("no change")   } }  test <- function(){     betters = rep(x = "better", 100)   worses = rep(x = "worse", 50)   no_changes = rep(x = "no_change", 10)   reference = sample(c(betters, worses, no_changes))    better = sum(reference == "better")   worse = sum(reference == "worse")   no_change = sum(reference == "no_change")   total = length(reference)    # rwc = random weighted classifer   prediction_rwc = vector("character", total)    iterations = 1000   f1_rwc = matrix(0., iterations, 3)   pres_rwc = matrix(0.,iterations, 3)   rec_rwc = matrix(0., iterations, 3)   kappa_rwc = vector("double", iterations)     for(i in seq(1:iterations)){      for(j in seq(1:total)){       prediction_rwc[[j]] = random_weighted_classifier(better/total, worse/total, no_change/total)     }     cm = (confusionMatrix(data = factor(prediction_rwc, levels = c("better","worse", "no_change")),                           reference = factor(reference, levels = c("better","worse", "no_change")),                           positive = c("better", "worse"),                           mode = "everything"))      f1_rwc[i,1:3] <-  cm\$byClass[,"F1"]     pres_rwc[i,1:3] = cm\$byClass[,"Precision"]     rec_rwc[i,1:3] = cm\$byClass[,"Recall"]     kappa_rwc[[i]] = round(cm\$overall["Kappa"],2)   }    print(list("f1" = c(mean(f1_rwc[,1], na.rm = T),mean(f1_rwc[,2], na.rm = T),mean(f1_rwc[,3], na.rm = T)),              "precision" = c(mean(pres_rwc[,1], na.rm = T),mean(pres_rwc[,2], na.rm = T),mean(pres_rwc[,3], na.rm = T)),              "recall" = c(mean(rec_rwc[,1], na.rm = T),mean(rec_rwc[,2], na.rm = T),mean(rec_rwc[,3], na.rm = T)),              "kappa" = mean(kappa_rwc, na.rm = T)))   }  test()
Original en ingles

I am computing a random weighted classifier based on the rates at which 3 labels appear in a "train" set. I want to use this RWC as a baseline for other classifiers. I'm doing this over 1000 iterations and then computing the mean of F1, Precision and Recall of each class besides the overall kappa.

Can this code run faster/look nicer? Minimum example here:

library(caret)  random_weighted_classifier <- function(weightA, weightB, weightC){   random_number = sample(1:100,1) / 100    if(random_number <= weightA){     return("better")   }else if (random_number > weightA && random_number <= (weightA + weightB)){     return("worse")   }else if(random_number > (weightA + weightB) && random_number <= (weightA + weightB + weightC)){     return("no change")   } }  test <- function(){     betters = rep(x = "better", 100)   worses = rep(x = "worse", 50)   no_changes = rep(x = "no_change", 10)   reference = sample(c(betters, worses, no_changes))    better = sum(reference == "better")   worse = sum(reference == "worse")   no_change = sum(reference == "no_change")   total = length(reference)    # rwc = random weighted classifer   prediction_rwc = vector("character", total)    iterations = 1000   f1_rwc = matrix(0., iterations, 3)   pres_rwc = matrix(0.,iterations, 3)   rec_rwc = matrix(0., iterations, 3)   kappa_rwc = vector("double", iterations)     for(i in seq(1:iterations)){      for(j in seq(1:total)){       prediction_rwc[[j]] = random_weighted_classifier(better/total, worse/total, no_change/total)     }     cm = (confusionMatrix(data = factor(prediction_rwc, levels = c("better","worse", "no_change")),                           reference = factor(reference, levels = c("better","worse", "no_change")),                           positive = c("better", "worse"),                           mode = "everything"))      f1_rwc[i,1:3] <-  cm\$byClass[,"F1"]     pres_rwc[i,1:3] = cm\$byClass[,"Precision"]     rec_rwc[i,1:3] = cm\$byClass[,"Recall"]     kappa_rwc[[i]] = round(cm\$overall["Kappa"],2)   }    print(list("f1" = c(mean(f1_rwc[,1], na.rm = T),mean(f1_rwc[,2], na.rm = T),mean(f1_rwc[,3], na.rm = T)),              "precision" = c(mean(pres_rwc[,1], na.rm = T),mean(pres_rwc[,2], na.rm = T),mean(pres_rwc[,3], na.rm = T)),              "recall" = c(mean(rec_rwc[,1], na.rm = T),mean(rec_rwc[,2], na.rm = T),mean(rec_rwc[,3], na.rm = T)),              "kappa" = mean(kappa_rwc, na.rm = T)))   }  test()

## Lista de respuestas

4

La mejor respuesta

algunas mejoras:

random_weighted_classifier2 <- function(n = 1, weightA, weightB, weightC){   x <- sample(1:100, n, replace = T) / 100   i1 <- x <= weightA   i2 <- x > weightA & x <= (weightA + weightB)   rez <- rep('no_change', n)   rez[i2] <- "worse"   rez[i1] <- "better"   rez }  test <- function(){    betters <- rep("better", 100)   worses <- rep("worse", 50)   no_changes <- rep("no_change", 10)   reference <- sample(c(betters, worses, no_changes))    better <- sum(reference == "better")   worse <- sum(reference == "worse")   no_change <- sum(reference == "no_change")   total <- length(reference)    iterations <- 1000   f1_rwc <- pres_rwc <- rec_rwc <- matrix(0., iterations, 3)   kappa_rwc <- vector("double", iterations)    referenceF <- factor(reference, levels = c("better","worse", "no_change"))   for (i in seq(1:iterations)) {      prediction_rwc <- random_weighted_classifier2(total,                                                   better/total,                                                   worse/total,                                                   no_change/total)     prediction_rwc <-        factor(prediction_rwc, levels = c("better","worse", "no_change"))     conTable <- table(prediction_rwc, referenceF)     cm <- confusionMatrix(conTable, positive = c("better", "worse"),                           mode = "everything")      f1_rwc[i, 1:3] <-  cm\$byClass[,"F1"]     pres_rwc[i, 1:3] <- cm\$byClass[,"Precision"]     rec_rwc[i, 1:3] <- cm\$byClass[,"Recall"]     kappa_rwc[[i]] <- round(cm\$overall["Kappa"], 2)   }    print(list("f1" = colMeans(f1_rwc, na.rm = T),              "precision" = colMeans(pres_rwc, na.rm = T),              "recall" = colMeans(rec_rwc, na.rm = T),              "kappa" = mean(kappa_rwc, na.rm = T))) }

Debe ser alrededor del 50% más rápido (16.80 vs 11.02 seg para 1k iteraciones).

Usted, posiblemente, podría acelerarlo aún más al eliminar la función confusionMatrix y el cálculo de todos los parámetros necesarios por sí mismo.

## Actualización

Basado en confusionMatrix Me logró extraer partes relevantes del código y envolver en una función:

statistics <- function(data, beta = 1) {   stat <- sapply(rownames(data), function(relevant) {      if (nrow(data) > 2) {       m <- matrix(NA, 2, 2)       colnames(m) <- rownames(m) <- c("rel", "irrel")       irrelCol <- which(!(colnames(data) %in% relevant))       relCol <- which(colnames(data) %in% relevant)       m[1, 1] <- sum(data[relCol, relCol])       m[1, 2] <- sum(data[relCol, irrelCol])       m[2, 1] <- sum(data[irrelCol, relCol])       m[2, 2] <- sum(data[irrelCol, irrelCol])       m <- as.table(m)       relevant <- "rel"     }     numer <- m[relevant, relevant]     denom <- sum(m[relevant, ])     prec <- ifelse(denom > 0, numer/denom, NA) # Precision      denom <- sum(m[, relevant])     rec <- ifelse(denom > 0, numer / denom, NA) # Recall      F1 <- (1 + beta^2)*prec*rec/((beta^2 * prec) + rec) # F1     c('Precision' = prec, 'Recall' = rec, 'F1' = F1)   })    k <- unlist(e1071::classAgreement(data))["kappa"]   list(stat, kappa = k) }

y luego test3 parece:

test3 <- function(iterations = 100){    vals <- c("better","worse", "no_change")   betters <- rep("better", 100)   worses <- rep("worse", 50)   no_changes <- rep("no_change", 10)   reference <- sample(c(betters, worses, no_changes))    better <- sum(reference == "better")   worse <- sum(reference == "worse")   no_change <- sum(reference == "no_change")   n <- length(reference)    f1_rwc <- pres_rwc <- rec_rwc <- matrix(0., iterations, 3)   kappa_rwc <- vector("double", iterations)   referenceF <- factor(reference, levels = vals)    for (i in seq(1:iterations)) {      prediction_rwc <-       random_weighted_classifier2(n, better/n, worse/n, no_change/n)     prediction_rwc <-  factor(prediction_rwc, levels = vals)     conTable <- table(prediction_rwc, referenceF)     cm2 <- statistics(conTable)     f1_rwc[i, 1:3] <-  cm2[[1]][3, ]     pres_rwc[i, 1:3] <- cm2[[1]][1, ]     rec_rwc[i, 1:3] <- cm2[[1]][2, ]     kappa_rwc[[i]] <- round(cm2[[2]], 2)   }    list("f1" = colMeans(f1_rwc, na.rm = T),        "precision" = colMeans(pres_rwc, na.rm = T),        "recall" = colMeans(rec_rwc, na.rm = T),        "kappa" = mean(kappa_rwc, na.rm = T)) }

Esto debería ejecutarse bajo 1SEC para 1k iteraciones.

P.s. Kappa se puede calcular con esto:

n <- sum(data)   ni <- rowSums(data)   nj <- colSums(data)   p0 <- sum(diag(data, names = F))/n   pc <- sum((ni/n) * (nj/n))   k <- (p0 - pc)/(1 - pc)

p.s.s. Al reducir a esas fórmulas, eliminé el código que se usó para las pruebas, por lo que si cambia su formato de datos, pueden aparecer errores. Asumí que la configuración no cambia.

Some improvements:

random_weighted_classifier2 <- function(n = 1, weightA, weightB, weightC){   x <- sample(1:100, n, replace = T) / 100   i1 <- x <= weightA   i2 <- x > weightA & x <= (weightA + weightB)   rez <- rep('no_change', n)   rez[i2] <- "worse"   rez[i1] <- "better"   rez }  test <- function(){    betters <- rep("better", 100)   worses <- rep("worse", 50)   no_changes <- rep("no_change", 10)   reference <- sample(c(betters, worses, no_changes))    better <- sum(reference == "better")   worse <- sum(reference == "worse")   no_change <- sum(reference == "no_change")   total <- length(reference)    iterations <- 1000   f1_rwc <- pres_rwc <- rec_rwc <- matrix(0., iterations, 3)   kappa_rwc <- vector("double", iterations)    referenceF <- factor(reference, levels = c("better","worse", "no_change"))   for (i in seq(1:iterations)) {      prediction_rwc <- random_weighted_classifier2(total,                                                   better/total,                                                   worse/total,                                                   no_change/total)     prediction_rwc <-        factor(prediction_rwc, levels = c("better","worse", "no_change"))     conTable <- table(prediction_rwc, referenceF)     cm <- confusionMatrix(conTable, positive = c("better", "worse"),                           mode = "everything")      f1_rwc[i, 1:3] <-  cm\$byClass[,"F1"]     pres_rwc[i, 1:3] <- cm\$byClass[,"Precision"]     rec_rwc[i, 1:3] <- cm\$byClass[,"Recall"]     kappa_rwc[[i]] <- round(cm\$overall["Kappa"], 2)   }    print(list("f1" = colMeans(f1_rwc, na.rm = T),              "precision" = colMeans(pres_rwc, na.rm = T),              "recall" = colMeans(rec_rwc, na.rm = T),              "kappa" = mean(kappa_rwc, na.rm = T))) }

It should be around 50 % faster (16.80 vs 11.02 sec for 1k iterations).

You, possibly, could speed it up even more by removing the confusionMatrix function and calculation of all of the necessary parameters by yourself.

## Update

Based on confusionMatrix I managed to extract relevant parts of code and wrap into a function:

statistics <- function(data, beta = 1) {   stat <- sapply(rownames(data), function(relevant) {      if (nrow(data) > 2) {       m <- matrix(NA, 2, 2)       colnames(m) <- rownames(m) <- c("rel", "irrel")       irrelCol <- which(!(colnames(data) %in% relevant))       relCol <- which(colnames(data) %in% relevant)       m[1, 1] <- sum(data[relCol, relCol])       m[1, 2] <- sum(data[relCol, irrelCol])       m[2, 1] <- sum(data[irrelCol, relCol])       m[2, 2] <- sum(data[irrelCol, irrelCol])       m <- as.table(m)       relevant <- "rel"     }     numer <- m[relevant, relevant]     denom <- sum(m[relevant, ])     prec <- ifelse(denom > 0, numer/denom, NA) # Precision      denom <- sum(m[, relevant])     rec <- ifelse(denom > 0, numer / denom, NA) # Recall      F1 <- (1 + beta^2)*prec*rec/((beta^2 * prec) + rec) # F1     c('Precision' = prec, 'Recall' = rec, 'F1' = F1)   })    k <- unlist(e1071::classAgreement(data))["kappa"]   list(stat, kappa = k) }

and then test3 looks like:

test3 <- function(iterations = 100){    vals <- c("better","worse", "no_change")   betters <- rep("better", 100)   worses <- rep("worse", 50)   no_changes <- rep("no_change", 10)   reference <- sample(c(betters, worses, no_changes))    better <- sum(reference == "better")   worse <- sum(reference == "worse")   no_change <- sum(reference == "no_change")   n <- length(reference)    f1_rwc <- pres_rwc <- rec_rwc <- matrix(0., iterations, 3)   kappa_rwc <- vector("double", iterations)   referenceF <- factor(reference, levels = vals)    for (i in seq(1:iterations)) {      prediction_rwc <-       random_weighted_classifier2(n, better/n, worse/n, no_change/n)     prediction_rwc <-  factor(prediction_rwc, levels = vals)     conTable <- table(prediction_rwc, referenceF)     cm2 <- statistics(conTable)     f1_rwc[i, 1:3] <-  cm2[[1]][3, ]     pres_rwc[i, 1:3] <- cm2[[1]][1, ]     rec_rwc[i, 1:3] <- cm2[[1]][2, ]     kappa_rwc[[i]] <- round(cm2[[2]], 2)   }    list("f1" = colMeans(f1_rwc, na.rm = T),        "precision" = colMeans(pres_rwc, na.rm = T),        "recall" = colMeans(rec_rwc, na.rm = T),        "kappa" = mean(kappa_rwc, na.rm = T)) }

This should run under 1sec for 1k iterations.

p.s. kappa can be calculated with this:

n <- sum(data)   ni <- rowSums(data)   nj <- colSums(data)   p0 <- sum(diag(data, names = F))/n   pc <- sum((ni/n) * (nj/n))   k <- (p0 - pc)/(1 - pc)

p.s.s. when reducing to those formulas, I stripped the code which was used for testing, so if you change your data format, bugs may appear. I assumed that the setting doesn't change.

6  Clase de Python para organizar imágenes para el aprendizaje automático  ( Python class for organizing images for machine learning )
Hice una clase para ayudarme a manejar los datos de la imagen para usar en el aprendizaje automático. Pensé que habría un paquete preexistente que hizo lo que...

8  Red neuronal de Python: número arbitrario de nodos ocultos  ( Python neural network arbitrary number of hidden nodes )
Estoy tratando de escribir una red neuronal que solo requiere que el usuario especifique la dimensionalidad de la red. Concretamente, el usuario podría defini...

3  Refactor Jaccard Similitud El "Scala Way"  ( Refactor jaccard similarity the scala way )
Estoy tratando de recoger Scala. Esta es una heurística simple que comprueba un valor de similitud entre dos conjuntos. He hecho esto un millón de veces en Ja...

8  Implementación de un nuevo algoritmo para Sklearn  ( Implementation of a new algorithm for sklearn )
En la biblioteca de Python, Sklearn se implementa el algoritmo para SPARSEPCA. He escrito el código para una versión de este algoritmo que es mucho más rápi...

12  Bot de chat simple  ( Simple chat bot )
Hice un bot de chat, que, a medida que hablas, aprende a responder. Pero la forma en que habla es extraño, por lo que si tiene alguna idea sobre cómo hacer qu...

4  Aprendizaje de la máquina Tic-Tac-Toe  ( Tic tac toe machine learning )
Recientemente comencé a meterme en el aprendizaje de la máquina y quería escribir un "programa para principiantes" que aprendería a jugar a Tic Tac Toe. Este ...

2  K_nearest_neighbors desde cero [cerrado]  ( K nearest neighbors from scratch )
cerrado. Esta pregunta es off-topic . Actualmente no está aceptando respuestas. ¿Quieres ...

2  Estadísticas básicas de encabezado único y Libray de ML para C ++ - Scikit-Aprenda como implementación  ( Basic single header statistics and ml libray for c scikit learn like impleme )
Estoy desarrollando Scikit-Aprenda como la implementación de C ++. Está en la etapa inicial mientras se está desarrollando, he comenzado a dudar de mí mismo, ...

3  Código para entrenar un modelo de reconocimiento de escritura a mano  ( Code for training a handwriting recognition model )
Acabo de hacer que mi código de aprendizaje de mi máquina funcione hace unos días y me gustaría saber si hay una manera de mejorar mi código. Antes de llega...

1  Agente de ajedrez usando aprendizaje de refuerzo con Monte Carlo Búsqueda  ( Chess agent using reinforcement learning with monte carlo tree search )
Quería preguntar Si este proyecto es válido para declarar en un currículum para el desarrollador de Python de nivel de entrada y, si el código es presentable ...