MLM, masked language modeling, is an important task for trianing a BERT model. In the orignal BERT paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, it is one of the main tasks of how BERT was pre-trained. So if you have your own corpus, it is possible to train MLM on any pre-trained BERT models, i.e., RoBERTa, SciBert.
Huggingface Library and Input tsv
The Huggingface library supports a various pre-trained BERT models. Now let’s first prepare a tsv file as our courpus, and this would be the input file to train the MLM. Simply, put the free-text in lines, and say we name this file to be
MLM for regular BERT Models
For some regular BERT models, the huggingface library supports some well-defined classes. Now let’s take the RoBERTa model as an example. There are three classes we need to be familar with:
from transformers import RobertaTokenizerFast from transformers import RobertaConfig from transformers import RobertaForMaskedLM tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") config = RobertaConfig( vocab_size=52_000, max_position_embeddings=514, num_attention_heads=12, num_hidden_layers=6, type_vocab_size=1, ) model = RobertaForMaskedLM(config=config)
tokenizer is associated with RoBERTa model. You will need to change to other classes if not using RoBERTa. Similarly, here we want to initialize a RoBERTa, so we pass the
config to init the
model. Now, to deal with the data loading:
from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 ) dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path='MyData.tsv', block_size=128, )
Simply, we provide the data path to
LineByLineTextDataset, so as to load in the data in lines. You might have noticed that in the
data_collator, we assign 0.15 as the probability of MLM, which is the same as the BERT paper. Next, start training and save trained model:
from transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir=trained_path, overwrite_output_dir=True, num_train_epochs=1, per_gpu_train_batch_size=16, save_steps=10_000, save_total_limit=2, prediction_loss_only=True, ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=dataset, ) print ('Start a trainer...') # Start training trainer.train() # Save trainer.save_model('trained_path/') print ('Finished training all...',trained_path)
After training is finished, under
trained_path, you will see the saved model. Next time, you can load in the model for your own downstream tasks.
MLM for special BERT Models
In the library, there are many other BERT models, i.e., SciBERT. Such models don’t have a special Tokenizer class or a Config class, but it is still possible to train MLM on top of those models. Next, let’s take the pre-trained SciBERT as an example, and the following is the way to init it:
from transformers import AutoConfig,AutoTokenizer,AutoModelForMaskedLM config = AutoConfig.from_pretrained('scibert_scivocab_uncased') tokenizer = AutoTokenizer.from_pretrained('scibert_scivocab_uncased') model = AutoModelForMaskedLM.from_pretrained('scibert_scivocab_uncased')
The rest of the code is exacly the same as the previous section. We use
Autoxxx class to let the library to auto-recognize the classes.