Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.1k views
in Technique[技术] by (71.8m points)

python - Reducing size of training dataset in tensorflow 2 tutorial (Transformer model for language understanding) with '.take(n)' method does not work

I am a Tensorflow-newbie, therefore bear with me if my question is too basic or stupid ;)

I tried to reduce the size of the training dataset in the "Transformer model for language understanding"-tutorial of the Tensorflow website (https://www.tensorflow.org/tutorials/text/transformer). My intention was to make my test runs faster, when playing around with the code.

I thought I could use the dataset.take(n) method to shorten the training datasets. I added two lines right after the original dataset is read from file:

...

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

# lines added to reduce dataset size
train_examples = train_examples.take(1000)
val_examples = val_examples.take(1000)

...

The resulting datasets (i.e., train_examples, val_examples) seem to have the intended size, and they seem to be working, e.g., with the tokenizer, which comes next in the turorial.

However, I get tons of error messages and warnings when I execute the code, more specifically when it enters training (i.e., train_step(inp, tar)). The error messages and warnings are too long to copy here, but perhaps an important part of it is this:

...

    /home/kst/python/tf/tf_env/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:1105 set_shape
        raise ValueError(

    ValueError: Tensor's shape (8216, 128) is not compatible with supplied shape (4870, 128)

WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1

...

Some of the tensors in the training part seem to have an inappropriate size or shape.

Is there a good reason why .take(n) is not a good method to shorten datasets in Tensorflow?

Is there a better way to do it?

Thanks! :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I found the problem! It was not the .take() method, which caused the error message. It was the checkpoint manager:

checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print ('Latest checkpoint restored!!')

I seem to have had old checkpoints on my disk, which were created from a run with a larger dataset. The checkpoint manager automatically recovered the old checkpoints and wanted to continue from there. Of course, the size and shape of the old tensors did not match the new ones (e.g., the vocabulary sizes were different). This created the error message. When I deleted the old checkpoints (i.d., the .ipynb_checkpoints directory) it all worked smoothly! :-)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...