ELMo: Deep contextualized word representations
In this blog, I show a demo of how to use pre-trained ELMo embeddings, and how to train your own embeddings.
Features?
- Pre-trained Embeddings from Language Models. Structure: Char-based CNN and Bidirectional LSTM (any number, 2 is typical).
- Content-dependent word representations. Given the same word, the embeddings for it may be different! Depending on the sentences and contents.
- Improved on supervised NLP tasks including question answering, coreference, semantic role labeling, classification, and syntactic parsing. Before BERT. But BERT you may need TPUs… ELMo is somehow enough for your own corpora. lol
Structure?
A nice illustration from this blog. It has two Bi-LSTM layers and a lower level word embedding layer.
How to use?
There are many ways for you to use the pre-trained embeddings (from the previous figure).
- Collapse into R-dim. You can concatenate all the three layers and make it a very large/long vector.
-
Learn task-specific weights. In the original paper, they learned a weight for each of the three layers. Then applied a weighted sum:
- Simplest (top level).
- Train your own embedding and concatenate with ELMo embeddings. If you have your own corpus, you can train your own ‘static’ word embeddings like word2vec. Then concatenate your ‘static’ one with ELMo embedding.
Train your own ELMo?
There may be various methods and links. But the following is the route for myself. The code for training is in TensorFlow. It is a CUDA version. When you install TF with GPU, be careful to check if your CUDA version is compatible with the TF version. Once the installation is complete. You need to prepare the following files to train:
– A vocabulary file.
– Training file: each row contains a raw sentence.
– Validation file: same formate with training file.
When start training, it is essential to check the printed perplexity. Terminate at 40 or more is fine. Rember to validate the model.
I tried with ~2m sentences, <24h on 5 GPUs. My training perplexity was 40+, and in the validation file, it was about 30+.
If you want to continue importing the embedding using pytorch, you will need to convert into .hdf5 files.
Code?
Simple test code from Github.
hello,
i would like to ask you if the 30+ is the AVERAGE PERPLEXITY?.
LikeLike
Yes, I think so.
LikeLike