MLM, masked language modeling, is an important task for trianing a BERT model. In the orignal BERT paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, it is one of the main tasks of how BERT was pre-trained. So if you have your own corpus, it is possible to train MLM on any pre-trained BERT models, i.e., RoBERTa, SciBert.
Huggingface Library and Input tsv
The Huggingface library supports a various pre-trained BERT models. Now let’s first prepare a tsv file as our courpus, and this would be the input file to train the MLM. Simply, put the free-text in lines, and say we name this file to be MyData.tsv
.
MLM for regular BERT Models
For some regular BERT models, the huggingface library supports some well-defined classes. Now let’s take the RoBERTa model as an example. There are three classes we need to be familar with:
from transformers import RobertaTokenizerFast
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
model = RobertaForMaskedLM(config=config)
The tokenizer
is associated with RoBERTa model. You will need to change to other classes if not using RoBERTa. Similarly, here we want to initialize a RoBERTa, so we pass the config
to init the model
. Now, to deal with the data loading:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path='MyData.tsv',
block_size=128,
)
Simply, we provide the data path to LineByLineTextDataset
, so as to load in the data in lines. You might have noticed that in the data_collator
, we assign 0.15 as the probability of MLM, which is the same as the BERT paper. Next, start training and save trained model:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir=trained_path,
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=16,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
print ('Start a trainer...')
# Start training
trainer.train()
# Save
trainer.save_model('trained_path/')
print ('Finished training all...',trained_path)
After training is finished, under trained_path
, you will see the saved model. Next time, you can load in the model for your own downstream tasks.
MLM for special BERT Models
In the library, there are many other BERT models, i.e., SciBERT. Such models don’t have a special Tokenizer class or a Config class, but it is still possible to train MLM on top of those models. Next, let’s take the pre-trained SciBERT as an example, and the following is the way to init it:
from transformers import AutoConfig,AutoTokenizer,AutoModelForMaskedLM
config = AutoConfig.from_pretrained('scibert_scivocab_uncased')
tokenizer = AutoTokenizer.from_pretrained('scibert_scivocab_uncased')
model = AutoModelForMaskedLM.from_pretrained('scibert_scivocab_uncased')
The rest of the code is exacly the same as the previous section. We use Autoxxx
class to let the library to auto-recognize the classes.