LectureBank: a dataset for NLP Education and Prerequisite Chain Learning

Introduction

In this blog post, we introduce our AAAI 2019 accepted paper “What Should I Learn First: Introducing LectureBank for NLP Education and Prerequisite Chain Learning.”
Our LectureBank dataset contains 1,352 English lecture files collected from university courses in mainly Natural Language Processing (NLP) field. Besides, each file is manually classified according to an existing taxonomy. Together with the dataset, we include 208 manually-labeled prerequisite relation topics. The dataset will be useful for educational purposes such as lecture preparation and organization as well as applications such as reading list generation. Additionally, we experiment with neural graph-based networks and non-neural classifiers to learn these prerequisite relations from our dataset.

LectureBank Dataset: resources for NLP!

We collected online lecture files from 60 courses covering 5 different domains, including NLP, ML, AI, deep learning (DL) and information retrieval (IR).

Dataset Statistics

We followed a fisheye strategy when collecting the dataset, where we wanted to focus on NLP mainly, then extended to other related fields like AI and DL. The following table shows detailed statistics about the dataset.
LectureBank Dataset Statistics
For preprocessing, we used the PDFMiner python package to extract the texts from the PDF files, and python-pptx to extract Powerpoint (PPT) presentations. If a course provided both PDF and PPT versions, we kept the PDF files and removed the PPT files.

Download

For copyright reasons, we are releasing only the links to the individual lectures. You can find our python script for downloading them all, however, a small number of the links may be invalid as the owner themselves have changed the URL.

Prerequisite Chain

Imagine the scenario in the following figure, in which a student has some basic knowledge of NLP but wants to learn a specific new concept such as POS tagging. In order to fully understand this concept, he or she should have an understanding of prerequisite concepts such as Viterbi Algorithm and Markov Models, as well as the prerequisites for these concepts: Dynamic Programming Bayes Theorem and Probabilities.

Prerequisite Chain Example

Prerequisite Annotation

In addition to using our own dataset for prerequisite chain learning, we make use of a recently introduced corpus of resources on topics related to NLP. (Fabbri et al. 2018) introduced a set of 208 topics on NLP and related fields. Our annotators consist of two Ph.D. students working on NLP. We asked the annotators the following question for each topic pair (A, B): is A a prerequisite of B; i.e., do you think learning the concept A will help one to learn the concept B?
We took the intersection of the two annotators’ annotations, which resulted in a labeled directed concept graph with 208 concept vertices and 921 edges. If concept A is a prerequisite of concept B, the edge direction goes from concept vertex A to concept vertex B. So eventually we will have a concept graph.

Concept Graph

We observed some cycles between a pair of vertices within the concept graph. We found 12 such pairs in our labeled concept graph. These pairs consist of very closely related topics such as Domain Adaptation and Transfer Learning and LDA and Topic Modeling, suggesting that in the future we may combine these pairs into a single concept. There are 7 independent topics which have no prerequisite relationships with the rest of the topics. They are: Morphological Disambiguation, Weakly-supervised learning, Multi-task Learning, ImageNet, Human-robot interaction, Game playing in AI, data structures and algorithms.

We also list the concept vertices that have the largest in-degree and out-degree in the following figure. In-degree illustrates that the concept vertex has many prerequisite concepts; outdegree illustrates that the concept vertex is a prerequisite to many other concepts. The concepts with large in-degree are advanced concepts which require much background knowledge in order to be learned well, while the list of concepts with large out-degree is more fundamental concepts.

enter image description here

We also observed the longest path in the constructed concept graph, which consists of 14 concepts in the path: Matrix Multiplication, Differential Calculus, Backpropagation, Backpropagation Through Time, Artificial Neural Network, Word Embeddings, Word2Vec, Seq2Seq, Neural Machine Translation, BLEU, IBM Translation Models, ROUGE, Automatic Summarization, Scientific Article Summarization.

Citation Credit

Please cite our paper if you want to use our LectureBank dataset:

@article{li2018should,
title={What Should I Learn First: Introducing LectureBank for NLP Education and Prerequisite Chain Learning},
author={Li, Irene and Fabbri, Alexander R and Tung, Robert R and Radev, Dragomir R},
journal={arXiv preprint arXiv:1811.12181},
year={2018}
}

One thought on “LectureBank: a dataset for NLP Education and Prerequisite Chain Learning

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s