Tutorial on Hyper-parameter tuning of Vanilla Transformer Encoder of non-textual sequence data for Classification tasks using Tensorflow and Keras tuner.
Table of Contents:
- Introduction.
- Code.
- Hyer-parameter explanation.
- Conclusion.
- References.
1. Introduction:
At the time of writing this article, OpenAi, and the GPT architectures are the rages of the town for sequence modeling, especially in the Natural Language Processing domain. This article doesn’t focus on any of the latest Transformer models rather it focuses on the original Transformer architecture (which I refer to as vanilla in this article's title). The first Transformer model was introduced in the seminal paper “Attention is all you need”[1]. This article does assume that the user is well aware of this Transformer architecture.
Here are a couple of articles that can be helpful if you are just getting started with Transformers:
- Attention is all you need: Discovering the Transformer paper
- Review — Attention Is All You Need (Transformer)
- Coding Attention Is All you Need using Tensorflow and Keras
- Implementing the Transformer Encoder from Scratch in TensorFlow and Keras
In my recent experience with sequence modeling with healthcare datasets where tokens are medical diagnoses in contrast to natural language tokens that BERT, GPT-2, GPT-3, etc are trained on. Hence I couldn’t use the pre-trained weights of the models mentioned for my work. In addition to that the healthcare dataset I had wasn’t huge to train billions of parameters of these models. So, I began exploring LSTM, and Bi-LSTM models. Once these models had reached their limits with their generalization next, I trained the dataset with the Vanilla Transformer model, where the performance was in fact not as good as the Bi-LSTM model. As a next step, I wanted to tune the parameters of the Vanilla Transformer model to better its performance. It was then that I realized that there weren’t a lot of articles detailing the steps for this task. After spending some time and experimenting I was able to tune the Transformer to beat the performance of LSTM-based models. Now I am putting together this article for hyper-parameter tuning of the Vanilla Transformer model for classification for people who want to do it on their own datasets.
2. Code:
This article will focus on tuning the encoder part of the transformer as it is the only component used during a classification task. This article will use the sample dataset of IMDB to tune parameters. This dataset is publicly available and is used here for demonstration purposes. You can plug in your personal dataset (text-based or non-text-based sequence data)and use the code and tips mentioned below.
Following is the code:
Sample output from the code above when run on Google Colab:
3. Hyer-parameter explanation:
Following are the hyper-parameters that I tweak in the build_model(hp) function in the code displayed above. I found tweaking the below gave some of the best results for me in my case (vanilla transformer on the sequential healthcare dataset where the diagnosis and healthcare providers were the input sequence tokens).
1. emb_vector_size Embedding Vector Dimensions, Number of dimensions for the embedding positional vector that is input into the encoder.
2. num_heads: Number of heads in Transformer Encoder. Increasing the number of heads can lead to better-represented vectors for classification tasks.
3. ff_dim: Hidden layer size in a feed-forward network inside the transformer. These are the layers in the Transformer encoder that have learnable parameters.
4. num_transformer_blocks: Total number of encoder layers. Increasing the number of these blocks may also lead to better-represented vectors for classification tasks. But this will increase the train time.
5. num_mlp_layers: Number of mlp layers after the encoder layer
6. mlp_units: Number of units in the Dense networks after the encoder layer.
Some additional hyper-parameters that can be tweaked are:
- The vocabulary size (VOCAB_SIZE in the code above).
- The max sequence length (MAX_LEN in the code above).
- Increasing Dropout Probability if you see the transformer overfitting the train set in small epochs (mlp_dropout and dropout in the code above).
These additional hyper-parameters mentioned above can be increased or decreased based on your particular dataset/use case. For me reducing the VOCAB_SIZE and MAX_LEN helped while increasing Dropout Probability combated the overfitting.
4. Conclusion:
This article articulated steps for tuning hyperparameters of a Vanilla Transformer Encoder for classification tasks. As mentioned earlier BERT, GPT-2, GPT-3, etc cannot be readily used if the sequence data isn’t consisting of Natural Language Tokens, and also training these models from scratch if the dataset is small won’t be fruitful given the billions of parameters they have. Instead starting with a simple RNN and bi-LSTM model can help establish the baseline and then one can experiment with Vanilla Transformer to beat these baselines and see if that achieves the performance improvement required.
The code for this tutorial can be found at:
Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions. If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir
:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)
References:
- Attention is all you need paper
- Attention is all you need: Discovering the Transformer paper
- Review — Attention Is All You Need (Transformer)
- Coding Attention Is All you Need using Tensorflow and Keras
- Implementing the Transformer Encoder from Scratch in TensorFlow and Keras
- Time series classification with a Transformer model using Tensorflow Keras
- Introduction to the Keras Tuner