Going the extra mile, lessons learnt from Kaggle on how to train better NLP models (Part I)


Kaggle Transformers Hugging Face


In this series we’ll explore some techniques that can squeeze a few extra points of performance from an NLP model.

We’ll demonstrate these techniques and the results of applying them in a Kaggle competition. While this was an NLP competition, some of these techniques can also be applied in other settings.

About the competition

The competition is the CommonLit Readability.

The objective is to rate the complexity of reading passages for classroom use by students in grades 3–12 (ages 8–18). The dataset is composed of texts of about 256 tokens with labels of the reading complexity score.

Reading complexity refers to identifying an appropriate reading level required for a child to understand a passage of text. This will depend on the grammatical, syntactic, semantic implications, as well as vocabulary used in the text. The score for each text is the average score given from 5–10 people that read and rated the passage.

As you can imagine, because of the nature of the task, the semantics as well as the syntax and text cohesion are very important. In this context text cohesion means the way in which words link together to form a logical meaning from beginning to end (in our case, tighter cohesion means: simpler sentences and a lower reading complexity).

The training data are quite small (just 2834 training examples) and that overfitting the data is a danger. We could try augmenting the data to increase the number of training examples, but more on that in a future blog post.

A word on evaluation

The scoring function used in the competition is the root mean squared error (RMSE), so smaller is better. Note that Kaggle has a public test set and a private test set. For this competition the public one contains 30% of the test data, while the private one has the remaining 70%. To give you a bit of context, the top scores were in the 0.44–0.45 RMSE range.

Let’s talk solutions

Manual feature extraction

A simple approach to this problem is to extract readability features as proposed in [1][2] and feed them to a logistic regression or shallow deep neural net with 2 layers. This approach, which was the approach that many competitors tried before attempting deep learning, will give you an RMSE of around 0.7. You can see it here.


A basic BERT model achieves an RMSE of about 0.62 on it’s own, with the default configuration from the transformers package. Let’s see how.

Transformers are currently the state of the art in NLP. You can easily create a good baseline for this problem by adding a regression layer on top of transformers and fine-tuning it.

Depending on the chosen pre-trained model doing that will get you an RMSE of between 0.62 to 0.70.

We did a comparison of a few different models by training each of them for one epoch with the default parameters. The results are included in the table below, but you should probably take them with a grain of salt, because of course you can probably get better results for most of them with a bit of hyperparameter tuning.

We see a surprising result in alBERT-v1 which is a much smaller model then the others (remember that we are evaluating with RSME so the smaller number is better). In the competition’s forum many offer the opinion that a smaller model performs better at first for this task partly because of the fixed semantic nature (a smaller model doesn’t learn as many nuanced features but more generic ones), and partly because larger models might actually overfit even in a single epoch.

The general consensus though, after many people tested different versions and hyperparameter sets of the models, was that RoBERTa and RoBERTa-large performed best (as was expected since RoBERTa currently is in the top for most NLP benchmarks)

Now, what if I told you that at the end of this series we’ll get to an RMSE of about 0.46 using a single model without ensembling?

That’s the difference between using a basic pre-trained model and taking the time to try a few different techniques that get you that extra mile of accuracy.

Basic version

We’ll first go over the complete code of this basic notebook that uses a pre-trained roberta-base model. In doing so we’ll set the scene for all the smaller changes that we’ll make later in the blog.

You have the complete code in the notebook above. Here we’ll just go over the more important parts.

Let’s first look at how the end training code looks like.

We load the data, model, optimizer and scheduler and then use all of them in a train function. Note that we also do garbage collection by using gc and deleting the model after we’re done with it. This is important because on the usual free ML cloud machines (Kaggle, Google Colab) that have 16Gb Ram, you will have a memory overflow error otherwise.

Note that MODEL_PATH and TOKENIZER_PATH need to point to the input folder as the rules of the competition specify your notebook doesn’t have internet access. We use that to load the RoBERTa tokenizer.

We use a custom defined Dataset (LitDataset) which will be used by a DataLoader which will yield the input in batches to the model. This is pretty standard as they handle memory, batching and other optimizations.

Let’s look closer at the LitModel class that defines our custom model wrapper.

In __init__() we load the configuration and the pretrained model and prepare a simple regression layer that will take the pre-trained model’s hidden state and point it to a single output.

In forward() we apply that regressor on top of RoBERTa’s last layer. In RoBERTa’s case there are thirteen layers with hidden states, one for the starting embedding layer and twelve for the RoBERTa layers.

Let’s also look at the create_optimizer() function as that will receive more custom changes down the road.

This function applies the AdamW optimizer. For now we simply use all of the model’s parameters and give them the same learning rate. (2e-5 seems to be the default most went with in the beginning)

We could have just done AdamW(model.parameters(), lr=2e-5) but we’re preparing this for when we’ll apply different learning rates to different layers.

Finally, the training function. We implement a typical pytorch training loop that iterates over epochs and batches and updates the model, printing metrics every ten batches.

For prediction we’ll simply load the model again, prepare the test data and call the predict() function to get our predictions.

This gets an RMSE of 0.624 on the public test set and 0.622 on the private one. Already better than the ‘features only’ model we did in the beginning.

In the next blog in this series, the fun will begin as we explore some techniques for further improving the score step by step.

Going the extra mile, lessons learnt from Kaggle on how to train better NLP models (Part II)


Are you interested in working with us?