Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

linux kernel - Tensorflow Object Detection Training Killed, Resource starvation?

This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.

I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config". And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.

My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.

train_config: {
  batch_size: 1000 #also tried 1, 10, and 100
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.04  # also tried .004
          decay_steps: 800 # also tried 800720. 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?

OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?

This is the Dmesg information that I get after the process is killed.

[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I met the same problem as you did. Actually,the memory full use is caused by the data_augmentation_options ssd_random_crop, so you can remove this option and set the batch size to 8 or smaller ie,2,4. When I set batch size to 1,I also met some problems cause by the nan loss.

Another thing is that the parameter epsilon should be a very small number, such as 1e-6 according to "deep learning" book. Because epsilon is used to avoid a zero denominator, but the default value here is 1, I don't think it is correct to set it to 1.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...