Resources for BioNLP: datasets and tools

Corpora for general medical texts

Open Research Corpus

Over 39 million published research papers in Computer Science, Neuroscience, and Biomedical.
Full dataset 36G, not restricted.


PubMed comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. A lot of papers use PubMed papers to pre-train word embeddings. Or others use papers to do long document summarization ( when they consider abstracts as the abstracts).


MIMIC-III is a freely accessible critical care database. But you will need to finish online training and get the certificate to apply for access. For NLP research, there is the table NOTEEVENTS which contains all notes for patients, and these notes are de-identified where the names and dates were masked when released. Similarly, it can be used to pre-train word embeddings as PubMed.


The link will let you interact with embeddings for over 108,000 medical concepts. These embeddings were created using insurance claims for 60 million Americans, 1.7 million full-text PubMed articles, and clinical notes from 20 million patients at Stanford. Check their paper Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data.

Medicine Graph

EHR represents a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures, and devices. In the paper, they provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.
Data download.

Corpora for specific NLP tasks

BioASQ Challenge Data on biomedical semantic indexing and question answering

The challenges include tasks relevant to hierarchical text classification, machine learning, information retrieval, QA from texts and structured data, multi-document summarization and many other areas. They released their word embeddings on PubMed dataset.

MedNLI for Natural Language Inference (NLI) in clinical Domain

Natural Language Inference (NLI) is one of the critical tasks for understanding natural language. The objective of NLI is to determine if a given hypothesis can be inferred from a given premise. MedNLI was designed by NLI task in the clinical domain. It contains 14,049 unique sentence pairs annotated by 4 clinicians on six weeks. Need to get access to MIMIC-III first then can download MedNLI. Code is available here. For more details, please check this paper Lessons from Natural Language Inference in the Clinical Domain.

Clinical Abbreviation Sense Inventory for medical term disambiguation

In the latest version, a total of 440 most frequently used abbreviations and acronyms were selected from 352,267 dictated clinical notes. 949 senses of each abbreviation and acronym were manually annotated from 500 random instances within clinical notes and lexically aligned with 17,359 long forms of the Unified Medical Language System (UMLS), 5,233 long forms of Another Database of Abbreviations in Medline (ADAM), and 4,879 long forms in Stedman’s Medical Abbreviations, Acronyms & Symbols (4th edition).



An open python library provides efficient tools on data mining and data analysis including methods on classification, regression, clustering and so on.

Natural Language Toolkitis (NLTK)

It is the leading platform for building Python programs to work with human language data, containing helpful and fundamental functions like tokenization and topic modeling.

Stanford CoreNLP

It is a set of software created by Stanford NLP Group that focus on statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems. Such software provides the basic and handy interface for NLP tasks like NER and part-of-speech (POS) tagging.


UMLS (Unified Medical Language System) integrates and distributes key terminology, classification and coding standards, and associated resources to promote the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Tne powerful use of the UMLS is linking health information, medical terms, drug names, and billing codes across different computer systems. The UMLS has many other uses, including search engine retrieval, data mining, public health statistics reporting, and terminology research.


It is a tool to identify medical concepts from text and map them to standard terminologies in the UMLS. MetaMap uses a knowledge-intensive approach based on symbolic, natural-language processing (NLP) and computational-linguistic techniques.

Review Papers

Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review
Clinical text classification research trends: Systematic literature review and open issues

More links…

Keep updating!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s