程序问答   发布时间:2022-06-01  发布网站:大佬教程  code.js-code.com
大佬教程收集整理的这篇文章主要介绍了model.train() 和 model.eval() 导致 nan 值大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。

如何解决model.Train() 和 model.eval() 导致 nan 值?

开发过程中遇到model.Train() 和 model.eval() 导致 nan 值的问题如何解决?下面主要结合日常开发的经验,给出你关于model.Train() 和 model.eval() 导致 nan 值的解决方法建议,希望对你解决model.Train() 和 model.eval() 导致 nan 值有所启发或帮助;

嘿,我正在尝试使用猴子物种数据集和 resnet50 进行图像分类/迁移学习,并使用经过修改的最终 fc 层来预测 10 个类别。一切都在工作,直到我使用 model.Train() 和 model.eval() 然后在第一个纪元之后它开始返回 nans 并且精度下降,如下所示。我很好奇为什么这只是在切换到训练/评估时......?

首先我导入模型并附加分类器并冻结参数

@H_696_7@%%capture
resnet = models.resnet50(preTrained=TruE)

for param in resnet.parameters():
  param.required_grad = false

in_features = resnet.fc.in_features


# Build custom classifIEr
classifIEr = nn.Sequential(orderedDict([('fc1',nn.linear(in_features,512)),('relu',nn.ReLU()),('drop',nn.Dropout(0.05)),('fc2',nn.linear(512,10)),]))

# ('output',nn.Logsoftmax(dim=1))
resnet.classifIEr = classifIEr

resnet.to(devicE)

然后设置我的损失函数、优化器和 shceduler

@H_696_7@# Step : define criterion and optimizer
criterion = nn.CrossEntropyLoss()
# pass the optimizer to the appended classifIEr layer
optimizer = torch.optim.SGD(resnet.parameters(),lr=0.01)
# scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer,milestones=[10],gAMMa=0.05)  

然后设置训练和验证循环

@H_696_7@epochs = 20


tr_losses = []
avg_epoch_tr_loss = []
tr_accuracy = []


val_losses = []
avg_epoch_val_loss = []
val_accuracy = []
val_loss_min = np.Inf


resnet.Train()
for epoch in range(epochs):
  for i,batch in enumerate(Train_loader):
    # Pull the data and labels from the batch
    data,label = batch
    # If available push data and label to GPU
    if Train_on_gpu:
      data,label = data.to(devicE),label.to(devicE)
    # Compute the logit
    logit = resnet(data)
    # Compte loss
    loss = criterion(logit,label)
    # Clearing the gradIEnt
    resnet.zero_grad()
    # BACkpropagate the gradIEnts (accumulte the partial derivatives of loss)
    loss.BACkWARD()
    # Apply the updates to the optimizer step in the opposite direction to the gradIEnt
    optimizer.step()
    # Store the losses of each batch
    # loss.item() seperates the loss from comp graph
    tr_losses.append(loss.item())
    # Detach and store the average accuracy of each batch
    tr_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
    # Print the rolling batch Training loss every 20 batches
    if i % 40 == 0 and not i == 1:
      print(f'Batch No: {i} \tAverage Training Batch Loss: {torch.tensor(tr_losses).mean():.2f}')
  # Print the average loss for each epoch
  print(f'\nEpoch No: {epoch + 1},Training Loss: {torch.tensor(tr_losses).mean():.2f}')
  # Print the average accuracy for each epoch
  print(f'Epoch No: {epoch + 1},Training Accuracy: {torch.tensor(tr_accuracy).mean():.2f}\n')
  # Store the avg epoch loss for plotTing
  avg_epoch_tr_loss.append(torch.tensor(tr_losses).mean())


  resnet.eval()
  for i,batch in enumerate(val_loader):
    # Pull the data and labels from the batch
    data,label.to(devicE)
    # Compute the logits without compuTing the gradIEnts
    with torch.no_grad():
      logit = resnet(data)
    # Compte loss
    loss = criterion(logit,label)
    # Store test loss
    val_losses.append(loss.item())
    # Store the accuracy for each batch
    val_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
    if i % 20 == 0 and not i == 1:
      print(f'Batch No: {i+1} \tAverage Val Batch Loss: {torch.tensor(val_losses).mean():.2f}')
  # Print the average loss for each epoch
  print(f'\nEpoch No: {epoch + 1},Epoch Val Loss: {torch.tensor(val_losses).mean():.2f}')
  # Print the average accuracy for each epoch    
  print(f'Epoch No: {epoch + 1},Epoch Val Accuracy: {torch.tensor(val_accuracy).mean():.2f}\n')
  # Store the avg epoch loss for plotTing
  avg_epoch_val_loss.append(torch.tensor(val_losses).mean())

  # checpoininTing the model using val loss threshold
  if torch.tensor(val_losses).float().mean() <= val_loss_min:
    print("Epoch Val Loss Decreased... Saving model")
    # save current model
    torch.save(resnet.state_Dict(),'/content/drive/MyDrive/1. Full Projects/Intel Image Classification/model_state.pt')
    val_loss_min = torch.tensor(val_losses).mean()
  # Step the scheduler for the next epoch
  scheduler.step()
  # Print the updated learning rate
  print('Learning Rate Set To: {:.5f}'.format(optimizer.state_Dict()['param_groups'][0]['lr']),'\n')

模型开始训练,然后慢慢变成 nan 值

@H_696_7@Batch No: 0     Average Training Batch Loss: 9.51
Batch No: 40    Average Training Batch Loss: 1.71
Batch No: 80    Average Training Batch Loss: 1.15
Batch No: 120   Average Training Batch Loss: 0.94

Epoch No: 1,Training Loss: 0.83
Epoch No: 1,Training Accuracy: 0.78

Batch No: 1     Average Val Batch Loss: 0.39
Batch No: 21    Average Val Batch Loss: 0.56
Batch No: 41    Average Val Batch Loss: 0.54
Batch No: 61    Average Val Batch Loss: 0.54

Epoch No: 1,Epoch Val Loss: 0.55
Epoch No: 1,Epoch Val Accuracy: 0.81

Epoch Val Loss Decreased... Saving model
Learning Rate Set To: 0.01000 

Batch No: 0     Average Training Batch Loss: 0.83
Batch No: 40    Average Training Batch Loss: nan
Batch No: 80    Average Training Batch Loss: nan

解决方法

我看到 @H_696_7@resnet.zero_grad() 在 @H_696_7@logit = resnet(data) 之后,这会导致渐变在您的情况下爆炸。

请按以下步骤操作:

@H_696_7@# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)

# Compute loss
loss = criterion(logit,label)

大佬总结

以上是大佬教程为你收集整理的model.train() 和 model.eval() 导致 nan 值全部内容,希望文章能够帮你解决model.train() 和 model.eval() 导致 nan 值所遇到的程序开发问题。

如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。