大佬教程收集整理的这篇文章主要介绍了model.train() 和 model.eval() 导致 nan 值,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
嘿,我正在尝试使用猴子物种数据集和 resnet50 进行图像分类/迁移学习,并使用经过修改的最终 fc 层来预测 10 个类别。一切都在工作,直到我使用 model.Train() 和 model.eval() 然后在第一个纪元之后它开始返回 nans 并且精度下降,如下所示。我很好奇为什么这只是在切换到训练/评估时......?
首先我导入模型并附加分类器并冻结参数
@H_696_7@%%capture resnet = models.resnet50(preTrained=TruE) for param in resnet.parameters(): param.required_grad = false in_features = resnet.fc.in_features # Build custom classifIEr classifIEr = nn.Sequential(orderedDict([('fc1',nn.linear(in_features,512)),('relu',nn.ReLU()),('drop',nn.Dropout(0.05)),('fc2',nn.linear(512,10)),])) # ('output',nn.Logsoftmax(dim=1)) resnet.classifIEr = classifIEr resnet.to(devicE)
然后设置我的损失函数、优化器和 shceduler
@H_696_7@# Step : define criterion and optimizer criterion = nn.CrossEntropyLoss() # pass the optimizer to the appended classifIEr layer optimizer = torch.optim.SGD(resnet.parameters(),lr=0.01) # scheduler scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer,milestones=[10],gAMMa=0.05)
然后设置训练和验证循环
@H_696_7@epochs = 20 tr_losses = [] avg_epoch_tr_loss = [] tr_accuracy = [] val_losses = [] avg_epoch_val_loss = [] val_accuracy = [] val_loss_min = np.Inf resnet.Train() for epoch in range(epochs): for i,batch in enumerate(Train_loader): # Pull the data and labels from the batch data,label = batch # If available push data and label to GPU if Train_on_gpu: data,label = data.to(devicE),label.to(devicE) # Compute the logit logit = resnet(data) # Compte loss loss = criterion(logit,label) # Clearing the gradIEnt resnet.zero_grad() # BACkpropagate the gradIEnts (accumulte the partial derivatives of loss) loss.BACkWARD() # Apply the updates to the optimizer step in the opposite direction to the gradIEnt optimizer.step() # Store the losses of each batch # loss.item() seperates the loss from comp graph tr_losses.append(loss.item()) # Detach and store the average accuracy of each batch tr_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean()) # Print the rolling batch Training loss every 20 batches if i % 40 == 0 and not i == 1: print(f'Batch No: {i} \tAverage Training Batch Loss: {torch.tensor(tr_losses).mean():.2f}') # Print the average loss for each epoch print(f'\nEpoch No: {epoch + 1},Training Loss: {torch.tensor(tr_losses).mean():.2f}') # Print the average accuracy for each epoch print(f'Epoch No: {epoch + 1},Training Accuracy: {torch.tensor(tr_accuracy).mean():.2f}\n') # Store the avg epoch loss for plotTing avg_epoch_tr_loss.append(torch.tensor(tr_losses).mean()) resnet.eval() for i,batch in enumerate(val_loader): # Pull the data and labels from the batch data,label.to(devicE) # Compute the logits without compuTing the gradIEnts with torch.no_grad(): logit = resnet(data) # Compte loss loss = criterion(logit,label) # Store test loss val_losses.append(loss.item()) # Store the accuracy for each batch val_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean()) if i % 20 == 0 and not i == 1: print(f'Batch No: {i+1} \tAverage Val Batch Loss: {torch.tensor(val_losses).mean():.2f}') # Print the average loss for each epoch print(f'\nEpoch No: {epoch + 1},Epoch Val Loss: {torch.tensor(val_losses).mean():.2f}') # Print the average accuracy for each epoch print(f'Epoch No: {epoch + 1},Epoch Val Accuracy: {torch.tensor(val_accuracy).mean():.2f}\n') # Store the avg epoch loss for plotTing avg_epoch_val_loss.append(torch.tensor(val_losses).mean()) # checpoininTing the model using val loss threshold if torch.tensor(val_losses).float().mean() <= val_loss_min: print("Epoch Val Loss Decreased... Saving model") # save current model torch.save(resnet.state_Dict(),'/content/drive/MyDrive/1. Full Projects/Intel Image Classification/model_state.pt') val_loss_min = torch.tensor(val_losses).mean() # Step the scheduler for the next epoch scheduler.step() # Print the updated learning rate print('Learning Rate Set To: {:.5f}'.format(optimizer.state_Dict()['param_groups'][0]['lr']),'\n')
模型开始训练,然后慢慢变成 nan 值
@H_696_7@Batch No: 0 Average Training Batch Loss: 9.51 Batch No: 40 Average Training Batch Loss: 1.71 Batch No: 80 Average Training Batch Loss: 1.15 Batch No: 120 Average Training Batch Loss: 0.94 Epoch No: 1,Training Loss: 0.83 Epoch No: 1,Training Accuracy: 0.78 Batch No: 1 Average Val Batch Loss: 0.39 Batch No: 21 Average Val Batch Loss: 0.56 Batch No: 41 Average Val Batch Loss: 0.54 Batch No: 61 Average Val Batch Loss: 0.54 Epoch No: 1,Epoch Val Loss: 0.55 Epoch No: 1,Epoch Val Accuracy: 0.81 Epoch Val Loss Decreased... Saving model Learning Rate Set To: 0.01000 Batch No: 0 Average Training Batch Loss: 0.83 Batch No: 40 Average Training Batch Loss: nan Batch No: 80 Average Training Batch Loss: nan
我看到 @H_696_7@resnet.zero_grad() 在 @H_696_7@logit = resnet(data) 之后,这会导致渐变在您的情况下爆炸。
请按以下步骤操作:
@H_696_7@# Clearing the gradient optimizer.zero_grad() logit = resnet(data) # Compute loss loss = criterion(logit,label)
以上是大佬教程为你收集整理的model.train() 和 model.eval() 导致 nan 值全部内容,希望文章能够帮你解决model.train() 和 model.eval() 导致 nan 值所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。