大佬教程收集整理的这篇文章主要介绍了在 R 中循环的替代方法,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在循环一个超过 600000 行的生态数据库。其中一行是“物种”,但它不仅包含物种名称,而且有时,当物种未知时,人们会写下类名、属或目等。
我需要根据每一位信息的分类级别将信息正确组织在不同的行中,因此我编写了一个循环来为每一行分配分类值,从同一数据帧中的其他行中获取值确实完成了。简化后的数据看起来像这个数据框:
data = data.frame(LaTin_name = c("Gadus morhua","GadIDae","Thalasoma pavo","Engraulis encrasicolus","Gadiformes","AcTinopterygii"),Family =c("GadIDae","NA","LabrIDae","EngraulIDae","NA"),Order = c("Gadiformes","Labriformes","Clupeiformes",Class = c("AcTinopterygii","AcTinopterygii","NA"))
我用来分配带有 NA 的列的循环是这样的:
for (row in 1:nrow(data)) {
if ((data$LaTin_name[row] %in% data$Family) == TRUE) {
data$FamilY[row] = data$FamilY[which(data$Family == data$LaTin_name[row])][1]
data$Order[row] = data$Order[which(data$Family == data$LaTin_name[row])][1]
data$Class[row] = data$Class[which(data$Family == data$LaTin_name[row])][1]
} else if ((data$LaTin_name[row] %in% data$Class) == TRUE) {
data$FamilY[row] = data$FamilY[which(data$Class == data$LaTin_name[row])][1]
data$Order[row] = data$Order[which(data$Class == data$LaTin_name[row])][1]
data$Class[row] = data$Class[which(data$Class == data$LaTin_name[row])][1]
} else if ((data$LaTin_name[row] %in% data$Order) == TRUE) {
data$FamilY[row] = data$FamilY[which(data$Order == data$LaTin_name[row])][1]
data$Order[row] = data$Order[which(data$Order == data$LaTin_name[row])][1]
data$Class[row] = data$Class[which(data$Order == data$LaTin_name[row])][1]
}
}
你们中有人能想到更快的方法吗?这个循环需要 10 多个小时才能运行 :'(
你应该避免在 R 中使用循环。相反,更推荐使用 dplyr / tidyverse 来完成你打算做的事情。我认为这种方法应该是这样的(感谢更正,因为我有一段时间没有使用 R)。
parsedData <- data %>% group_by(*variable name you would like to group*) %>% summarise(
newVariable = *the components of your new variable*
)
然而,最重要的是避免循环。如果你需要循环一些大的东西,你也可以看看 sapply、apply、lapply 和 mapply 函数。
,data2 = data
data2[data2 == "NA"] = NA
rownames(data2) = 1:nrow(data2)
cleanData = data2[rowSums(!is.na(data2)) >= 4,]FamilyKey = setNames(cleanData[,2],cleanData[,1])
OrderKey = setNames(cleanData[,3],2])
ClassKey = setNames(cleanData[,4],3])
data3 = data2[rowSums(!is.na(data2)) < 4,]
# step 1: label to the right: label Family,Order and Class from LaTin_name where missing
data3[data3$LaTin_name %in% (FamilyKey),]$Family = revalue(data3[data3$LaTin_name %in% (FamilyKey),]$LaTin_name,FamilyKey)
data3[data3$LaTin_name %in% (OrderKey),]$Order = revalue(data3[data3$LaTin_name %in% (OrderKey),OrderKey)
data3[data3$LaTin_name %in% (ClassKey),]$Class = revalue(data3[data3$LaTin_name %in% (ClassKey),ClassKey)
#step 2: label to the right again: label order from family and class from order
data3[data3$Family %in% (FamilyKey),]$Order = revalue(data3[data3$Family %in% (FamilyKey),]$Family,OrderKey)
data3[data3$Order %in% (OrderKey),]$Class = revalue(data3[data3$Order %in% (OrderKey),]$Order,ClassKey)
# combine together are reorder to original order
data2 = rbind(cleanData,data3)
data2 = data2[ order(rownames(data2)),drop=F]
你的输出:
> data
LaTin_name Family Order Class Other_cols
1 Gadus morhua Gadidae Gadiformes AcTinopterygii x
2 Gadidae Gadidae Gadiformes AcTinopterygii NA
3 Thalasoma pavo Labridae Labriformes AcTinopterygii x
4 Engraulis encrasicolus Engraulidae Clupeiformes AcTinopterygii x
5 Gadiformes Gadidae Gadiformes AcTinopterygii NA
6 AcTinopterygii Gadidae Gadiformes AcTinopterygii NA
我的输出:
> data2
LaTin_name Family Order Class Other_cols
1 Gadus morhua Gadidae Gadiformes AcTinopterygii x
2 Gadidae Gadidae Gadiformes AcTinopterygii <NA>
3 Thalasoma pavo Labridae Labriformes AcTinopterygii x
4 Engraulis encrasicolus Engraulidae Clupeiformes AcTinopterygii x
5 Gadiformes <NA> Gadiformes AcTinopterygii <NA>
6 AcTinopterygii <NA> <NA> AcTinopterygii <NA>
我严格假设 Family
以上是大佬教程为你收集整理的在 R 中循环的替代方法全部内容,希望文章能够帮你解决在 R 中循环的替代方法所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。