blur-bg

101: -- โ€ฆ

Categories:

Torch Hugging face Transformers Nlp

๐„๐ฆ๐›๐ž๐๐๐ข๐ง๐ ๐ฌ ๐Œ๐ž๐ซ๐ ๐ž 101: ๐€ ๐’๐ญ๐ž๐ฉ-๐›๐ฒ-๐ฌ๐ญ๐ž๐ฉ ๐ ๐ฎ๐ข๐๐ž ๐จ๐ง ๐ฆ๐ž๐ซ๐ ๐ข๐ง๐  ๐ž๐ฆ๐›๐ž๐๐๐ข๐ง๐ ๐ฌ (same or different family/architecture)

Photo by Ricardo Gomez Angel on Unsplash

Have you identified several fill-mask (aka language models aka embedding models) which complement each other, and you want to use all of them to train, for example, a classifier on top of it? Easy-peasy.

The conditions you need to meet are:

  • For a simplified approach, the embeddings must be of the same architecture (letโ€™s say, both roberta-base or both bert-large-case) and same dimensions.
  • However, if you are very invested in merging models even from different families (e.g, textual and image embeddings, or BERT and RoBERTa), take a look at the complex approach, which describes how you could even merge models from different families.

Simplified approach: Merging models of same architecture

Combining two fill-mask models into one in Hugging Face generally involves a few steps:

  1. Load the Models: Load the two models you want to combine.
  2. Combine the Embeddings: Merge the embeddings from both models.
  3. Create a New Model: Integrate the combined embeddings into a new model.

Step 1: Load the Models

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name_1 = ‘model_name_1’
model_name_2 = ‘model_name_2’

Load the models

model1 = AutoModelForMaskedLM.from_pretrained(model_name_1)
model2 = AutoModelForMaskedLM.from_pretrained(model_name_2)

Load the tokenizers

tokenizer1 = AutoTokenizer.from_pretrained(model_name_1)
tokenizer2 = AutoTokenizer.from_pretrained(model_name_2)

Step 2: Combine the Embeddings

You need to ensure that both models have compatible tokenizers or you might need to handle token mappings. There are several techniques to do so:

Averaging Embeddings

The default technique is to sum the embeddings and divide by 2.

import torch

Get embeddings from both models

embeddings1 = model1.base_model.embeddings.word_embeddings.weight
embeddings2 = model2.base_model.embeddings.word_embeddings.weight

Ensure the dimensions match

assert embeddings1.shape == embeddings2.shape, “Embedding dimensions do not match!”

Combine embeddings (e.g., by averaging)

combined_embeddings = (embeddings1 + embeddings2) / 2

Concatenating the embeddings

Concatenating the embeddings from two models increases the dimensionality of the resulting embeddings, which may capture more information from both models.

import torch

Concatenate embeddings

combined_embeddings = torch.cat((embeddings1, embeddings2), dim=-1)

Adjust the embedding layer of the new model to match the new dimension

new_model.config.hidden_size = combined_embeddings.shape[-1]
new_model.base_model.embeddings.word_embeddings = torch.nn.Embedding.from_pretrained(combined_embeddings)

Weighted Sum

Use a weighted sum to combine the embeddings. You can learn the weights during training or set them manually.

alpha = 0.7 # Weight for the first model
beta = 0.3 # Weight for the second model

Combine embeddings with weights

combined*embeddings = alpha * embeddings1 + beta _ embeddings2

Set the combined embeddings to the new model

new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Linear Transformations

Use a linear transformation to combine the embeddings. This approach allows learning a transformation matrix during training.

import torch.nn as nn

class LinearCombiner(nn.Module):
def init(self, embedding_dim):
super(LinearCombiner, self).init()
self.transform = nn.Linear(embedding_dim * 2, embedding_dim)

def forward(self, emb1, emb2):
    combined = torch.cat((emb1, emb2), dim=-1)
    return self.transform(combined)

combiner = LinearCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1, embeddings2)

Set the combined embeddings to the new model

new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Use a Neural Network instead to combine them

Use a small neural network to learn how to combine the embeddings.

class EmbeddingCombiner(nn.Module):
def init(self, embedding_dim):
super(EmbeddingCombiner, self).init()
self.fc1 = nn.Linear(embedding_dim * 2, embedding_dim)
self.fc2 = nn.Linear(embedding_dim, embedding_dim)
self.relu = nn.ReLU()

def forward(self, emb1, emb2):
    combined = torch.cat((emb1, emb2), dim=-1)
    combined = self.relu(self.fc1(combined))
    return self.fc2(combined)

combiner = EmbeddingCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1, embeddings2)

Set the combined embeddings to the new model

new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Neural network + Attention Mechanism

Use an attention mechanism to learn how to combine embeddings. This method allows the model to weigh the importance of each embedding dynamically.

class AttentionCombiner(nn.Module):
def init(self, embedding_dim):
super(AttentionCombiner, self).init()
self.attention = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=1)

def forward(self, emb1, emb2):
    combined = torch.stack((emb1, emb2), dim=0)
    attention_output, _ = self.attention(combined, combined, combined)
    return torch.mean(attention_output, dim=0)

combiner = AttentionCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1.unsqueeze(1), embeddings2.unsqueeze(1)).squeeze(1)

Set the combined embeddings to the new model

new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Step 3: Create a New Model

Create a new model with the combined embeddings. We can do this by initializing a new model and replacing its embeddings with the combined ones.

from transformers import BertConfig, BertForMaskedLM

Use a configuration from one of the models or create a new one

config = BertConfig.from_pretrained(model_name_1)

Create a new model

new_model = BertForMaskedLM(config)

Replace the embeddings with the combined embeddings

new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Save the new model

new_model.save_pretrained(‘combined_model’)

Step 4: Save and Load the New Model

Save the model so you can load it later as needed.

Save the new model and tokenizer

new_model.save_pretrained(‘path_to_combined_model’)
tokenizer1.save_pretrained(‘path_to_combined_model’)

Loading and Using the New Model

Now you can load and use the new model as usual.

from transformers import AutoModelForMaskedLM, AutoTokenizer

Load the new model and tokenizer

new_model = AutoModelForMaskedLM.from_pretrained(‘path_to_combined_model’)
tokenizer = AutoTokenizer.from_pretrained(‘path_to_combined_model’)

Example usage

input_text = “This is a [MASK] example.”
inputs = tokenizer(input_text, return_tensors=‘pt’)
outputs = new_model(**inputs)

Complex approach: What if the embeddings come from different families?

Combining embeddings from different model families, such as textual and image embeddings or a modelroberta-base with bert-base-uncased, can be more challenging than combining embeddings from the same model family. This is because different model architectures may have different embedding dimensions, tokenization strategies, and even pre-training objectives. However, it’s not impossible.

The steps now include include additionally:

  1. Tokenization Alignment: Different models often have different tokenizers. To combine embeddings, you need to align the tokenization strategies. One approach is to use a unified tokenizer that works with both models, but this can be complex.
  2. Embedding Dimension Alignment: If the embedding dimensions of the two models are different, youโ€™ll need to align them. This can be done using techniques like linear transformation, zero-padding, or projection to a common space.

Hereโ€™s a more detailed example that combines embeddings from roberta-base and bert-base-uncased:

Step 1: Load the Models and Tokenizers

from transformers import AutoModelForMaskedLM, AutoTokenizer

Load the models

model_name_1 = ‘roberta-base’
model_name_2 = ‘bert-base-uncased’

model1 = AutoModelForMaskedLM.from_pretrained(model_name_1)
model2 = AutoModelForMaskedLM.from_pretrained(model_name_2)

Load the tokenizers

tokenizer1 = AutoTokenizer.from_pretrained(model_name_1)
tokenizer2 = AutoTokenizer.from_pretrained(model_name_2)

Step 2: Tokenization

You need to ensure the tokens from both tokenizers align. One way is to tokenize the input with both tokenizers and handle the alignment manually.

input_text = “This is a [MASK] example.”

tokens1 = tokenizer1.tokenize(input_text)
tokens2 = tokenizer2.tokenize(input_text)

Convert tokens to IDs

ids1 = tokenizer1.convert_tokens_to_ids(tokens1)
ids2 = tokenizer2.convert_tokens_to_ids(tokens2)

Ensure alignment, e.g., by padding or truncating

maxlength = max(len(ids1), len(ids2))
ids1 = ids1 + [tokenizer1.pad_token_id] * (max
length - len(ids1))
ids2 = ids2 + [tokenizer2.pad_token_id] * (max_length - len(ids2))

Step 3: Get Embeddings

Retrieve the embeddings from both models.

import torch

Get embeddings

embeddings1 = model1.roberta.embeddings.word_embeddings.weight
embeddings2 = model2.bert.embeddings.word_embeddings.weight

Step 4: Align Embedding Dimensions

If the embedding dimensions differ, use a linear layer to project them to a common dimension.

import torch.nn as nn

Assuming embeddings1 and embeddings2 have different dimensions

dim1 = embeddings1.size(1)
dim2 = embeddings2.size(1)
common_dim = max(dim1, dim2)

Linear layers to project to common dimension

linear1 = nn.Linear(dim1, common_dim)
linear2 = nn.Linear(dim2, common_dim)

projected_embeddings1 = linear1(embeddings1)
projected_embeddings2 = linear2(embeddings2)

Step 5: Combine the Embeddings

Combine the projected embeddings using a chosen technique (e.g., concatenation, weighted sum).

Combine embeddings, for example, by averaging

combined_embeddings = (projected_embeddings1 + projected_embeddings2) / 2

Create a new embedding layer

new_embedding_layer = nn.Embedding.from_pretrained(combined_embeddings)

Step 6: Integrate into a New Model

Integrate the combined embeddings into a new model architecture.

from transformers import BertConfig, BertForMaskedLM

Create a new configuration

config = BertConfig.from_pretrained(model_name_2)

Initialize a new model

new_model = BertForMaskedLM(config)

Replace the embeddings with the combined embeddings

new_model.bert.embeddings.word_embeddings = new_embedding_layer

Save the new model

new_model.save_pretrained(‘path_to_combined_model’)
tokenizer2.save_pretrained(‘path_to_combined_model’)

Last Note

Regardless the approach you have followed, itโ€™s crucial that you finetune your model on a relevant dataset, to make sure your itโ€™s completely aligned to your new data.

Enjoy and donโ€™t forget to drop a like โค๏ธ!

Need help?

At Mantis, our experienced team of NLP engineers is ready to help. If you have any NLP-related question, reach out to us at hi@mantisnlp.com.


๐„๐ฆ๐›๐ž๐๐๐ข๐ง๐ ๐ฌ ๐Œ๐ž๐ซ๐ ๐ž 101: ๐€ ๐’๐ญ๐ž๐ฉ-๐›๐ฒ-๐ฌ๐ญ๐ž๐ฉ ๐ ๐ฎ๐ข๐๐ž ๐จ๐ง ๐ฆ๐ž๐ซ๐ ๐ข๐ง๐ โ€ฆ was originally published in MantisNLP on Medium, where people are continuing the conversation by highlighting and responding to this story.

Next Article

How we are thinking about generative AI: costs and abilities

Weโ€™ve written a couple of previous blogs giving our perspectives on generative โ€ฆ

Read Post

Do you have a Natural Language Processing problem you need help with?

Let's Talk