==============================================================================
==== The CRAFT (Colorado Richly Annotated Full-Text) Corpus ====
==============================================================================
The contents of this downloaded tarball consist of the CRAFT Corpus v2.0 release.
This release consists of 67 articles from the PubMed Central
Open Access subset. Each article has been annotated both syntactically and
conceptually. For the syntactic annotation of the corpus, all sentences have
been marked up with regard to sentence segmentation, tokenization,
part-of-speech tagging, coreference, and manually curated syntactic parses for each sentence
are available in Penn Treebank format. The concept annotation identifies all
mentions of all concepts from nine prominent biomedical ontologies and
terminologies: the Cell Ontology, the Chemical Entities of Biological
Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence
Ontology, the entries of the Entrez Gene database, and the three subontologies
of the Gene Ontology (i.e., Biological Processes, Cellular Components, and
Molecular Functions).
For details of the concept annotations see:
Concept Annotation in the CRAFT Corpus.
Bada, M.*, Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D.,
Baumgartner Jr., W. A., Cohen, K. B., Verspoor, K., Blake, J. A.,
and Hunter, L. E.
BMC Bioinformatics. 2012 Jul 9;13:161.
doi: 10.1186/1471-2105-13-161
PubMed:22776079
For gene mention and syntactic tool performance over CRAFT see:
A corpus of full-text journal articles is a robust evaluation tool for
revealing differences in performance of biomedical natural language
processing tools.
Verspoor, K.*, Cohen, K.B.*, Lanfranchi, A., Warner, C., Johnson, H.L.,
Roeder, C., Choi, J.D., Funk, C., Malenkiy, Y., Eckert, M., Xue, N.,
Baumgartner Jr., W.A., Bada, M., Palmer, M., Hunter L.E.
BMC Bioinformatics. 2012 Aug 17;13(1):207.
PubMed:22901054
For a detailed overview of the coreference annotations included in CRAFT, see:
Coreference annotation and resolution in the Colorado Richly Annotated Full
Text (CRAFT) corpus of biomedical journal articles.
K. Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi; Michael Bada,
William A. Baumgartner Jr., Natalya Panteleyeva, Karin Verspoor,
Martha Palmer, Lawrence E. Hunter.
BMC Bioinformatics 2016
The CRAFT Corpus is available for download from:
http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml
==============================================================================
==== Directory Structure ====
==============================================================================
articles/
---------ids/
---------nxml/
---------txt/
coreference/brat
dependency/
genia-xml/
---------pos/
---------term/
knowtator-xml/
ontologies/
protege/
rdf/
treebank/
xmi/
xml/
::: articles/ids/ :::
Contains a file listing the PubMed IDs contained in this distribution
(craft-pmids-release) and a file mapping from PubMed ID to PubMed
Central ID and original downloaded file name for all articles in this
distribution (craft-idmappings-release).
::: articles/nxml/ :::
Contains the original XML for each article in this distribution as
downloaded as part of the PubMed Central Open Access collection.
::: articles/txt/ :::
Contains a plain text version of each article that was derived from the
original XML files. NOTE: Annotation offsets included in this
distribution are relative to the plain-text versions of the articles.
The file name for any given article is its PubMed ID with a ".txt"
extension. All CRAFT articles use UTF-8 encoding. Also included for
each article are files containing the copyright information
([PUBMED_ID].copyright) and the article's references
([PUBMED_ID].references).
::: coreference/brat :::
Contains the coreference annotations for CRAFT serialized using the BRAT
annotation format. For details see: http://brat.nlplab.org/
::: dependency/ :::
Contains dependency parse trees for each sentence of every article that is
part of this distribution.
::: genia-xml/pos :::
Contains files showing sentence, token, and part-of-speech information in
the GENIA-style POS embedded XML format as defined in
http://www-tsujii.is.s.u-tokyo.ac.jp/~jdkim/publications/GENIA_Corpus_Manual.pdf.
::: genia-xml/term :::
Contains files that represent the CRAFT concept annotations and in the
GENIA-style TERM embedded XML format as defined in
http://www-tsujii.is.s.u-tokyo.ac.jp/~jdkim/publications/GENIA_Corpus_Manual.pdf.
NOTE: The GENIA-style embedded XML format is unable to handle multi-span
annotations. FOR THIS REASON, THE REPRESENTATION OF THE CRAFT CONCEPT
ANNOTATIONS IN THIS FORMAT IS INCOMPLETE. All split-span annotations have
been excluded from this output format. Missing annotations are documented
in files ending with ".excluded_annotations". The annotation format used
is the following: class [tab] span(s) [tab] covered_text.
::: knowtator-xml/ :::
Contains XML stand-off annotation in the Knowtator XML output format
for all concept and coreference annotations for every article in this distribution.
Before using this data outside of the Knowtator application, please
see the section on "Annotation Offsets and Java."
::: ontologies/ :::
Contains the original ontology files used for concept annotation.
All files in OBO format.
::: protege/ :::
Contains Protege files for loading a project containing all concept
and coreference annotations using the Knowtator plugin.
::: rdf/ :::
Contains an AO RDF (http://code.google.com/p/annotation-ontology/)
representation of the documents and annotations for the articles that
are part of this distribution.
::: treebank/ :::
Contains the full syntactic parse trees in Penn Treebank style for each
sentence in every article that is part of this distribution .
::: xmi/ :::
Contains serialized UIMA (Unstructured Information Management
Architecture; http://uima.apache.org/) CAS files using the UIMA XMI
format. For the semantic concepts, the CCP type system (included) has been
used. The ClearTK (http://code.google.com/p/cleartk/) type system has been
used to represent the syntactic types as well as coreference annotations.
(Also, see note below on annotation offsets and Java)
::: xml/ :::
Contains XML stand-off annotation in the Knowtator XML output format
for all concept annotations for every article in this distribution.
The offsets supplied in these files are based on Unicode code points,
and thus may differ from the offsets found in the knowtator-xml/ files.
Please see the section on "Annotation Offsets and Java" for clarification.
==============================================================================
==== Character Encoding ====
==============================================================================
UTF-8 encoding is used throughout the CRAFT project, so please default to
UTF-8 when using CRAFT resources.
==============================================================================
==== Annotation Offsets and Java ====
==============================================================================
******* NOTE: CHARACTER OFFSETS FOR SOME FORMATS ARE SPECIFIC TO JAVA *******
A number of the offset-annotation output formats (UIMA XMI and Knowtator XML)
are specifically geared for use by applications coded in Java. These files
have been produced by Java applications and due to the way Java encodes
supplementary Unicode code points the character offsets defined in these files
may be incorrect if used outside of a Java environment. The reason for this
potential discrepancy is that some Unicode code points are represented in Java
using more than one character primitive.
For details, please see:
http://download.oracle.com/javase/6/docs/api/java/lang/Character.html
The distribution includes stand-off annotations in the xml/ directory that
use Unicode code points as offsets. If using CRAFT in a non-Java environment
use of the files in the xml/ directory is recommended.
==============================================================================
==== Formatting ====
==============================================================================
The CRAFT Corpus has been made available in a number of different formats.
Availability of the semantic and syntactic annotations in the various
formats is detailed in the table below.
-----------------------------------------------------------------
| Format | Concepts | Syntactic | Coref |
-----------------------------------------------------------------
| AO RDF | X | | |
| BRAT | | | X |
| CoNLL Dep. Tree | | Dependency parse trees | |
| GENIA XML | X* | Sentences, tokens, POS | |
| Knowtator XML | X | | X |
| Penn TreeBank | | Full syntactic trees | |
| Protege | X | | X |
| UIMA XMI | X | Full syntactic trees | |
| XML | X | | X |
-----------------------------------------------------------------
* Indicates incomplete representation of the CRAFT Corpus
::: AO RDF :::
CRAFT semantic annotations have been represented in RDF/XML, making
extensive use of the Annotation Ontology
(http://code.google.com/p/annotation-ontology/). The RDF format
references versions of the documents on the PubMed Central Web site,
e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138691/?tool=pubmed.
Annotations are defined using the Annotation Ontology
PrefixPostfixTextSelector paradigm
(see http://code.google.com/p/annotation-ontology/wiki/Selectors).
::: BRAT :::
The standoff annotation format used by the BRAT annotation tool.
For details see: http://brat.nlplab.org/
::: CoNLL Dependency Tree :::
Dependency parse trees were generated with the CLEAR Parser
(http://code.google.com/p/clearparser/) using the CRAFT Treebank data
as input. These parse trees have not been manually vetted.
::: GENIA XML :::
CRAFT concept annotations have been represented in the GENIA-style
Embedded XML format. All formats comply to the format definitions
described in http://www-tsujii.is.s.u-tokyo.ac.jp/~jdkim/publications/GENIA_Corpus_Manual.pdf.
Available GENIA-style embedded XML formats include part-of-speech tags and
concept annotation. NOTE: Overlapping split-span annotations cannot
be represented in the GENIA embedded XML format; thus, a number of
annotations are excluded from this format. Such cases are logged in the
accompanying ".excluded_annotations" files.
::: Knowtator XML :::
The XML output format produced by the Knowtator application.
(http://knowtator.sourceforge.net/) This format can be easily imported
into Knowtator projects. Annotation offsets in this format are
relative to the plain text versions of the articles that can be found
in the articles/txt directory. Before making use of this format, please
read the "Annotation Offsets and Java" section.
::: Penn TreeBank :::
Full syntactic parse trees in the Penn Treebank style
::: UIMA XMI :::
Versions of a UIMA CAS data structure representing each article and its
annotations have been serialized using the UIMA XMI data format.
Concept annotations are represented using the CCP Type System
(CCPTypeSystem.xml; included in this distribution) while Treebank data
is represented using ClearTK (http://code.google.com/p/cleartk/) type
system.
::: XML :::
XML standoff annotation of the CRAFT concept annotations in the same
format as the Knowtator XML; however, the offsets in these files are to
Unicode code points (and not Java characters).
==============================================================================
==== Browsing CRAFT ====
==============================================================================
::: Browse via BRAT :::
The CRAFT Corpus is available for browsing via the brat rapid annotation
tool (http://brat.nlplab.org/) here:
http://compbio.ucdenver.edu/Craft/index.xhtml
Hover over the annotations to see their class information. Detailed usage
instructions are available here: http://brat.nlplab.org/manual.html
Note that this BRAT installation is configured to be read-only (no editting
is permitted).
Note that BRAT cannot display discontinuous annotations so all
discontinuous annotations are shown with their entire span. When hovering
over a discontinuous annotation the discontinuous text is displayed,
e.g. "region on the .. chromosome". For an example see the following
Sequence Ontology annotation:
http://compbio.ucdenver.edu/Craft/index.xhtml#/16700629?focus=T124.
::: Browse via DOMEO :::
CRAFT will be available for browsing in the DOMEO annotation toolkit in the
coming months. (http://annotationframework.org/)
::: Protege :::
1) Download and install Protege v3.3.1 and Knowtator v1.7.4
- while there are newer versions of Knowtator, the annotation was done
using v1.7.4 so that is the recommended version for browsing the
CRAFT corpus.
- follow instructions 1-6 here:
http://knowtator.sourceforge.net/install.shtml
2) Increase the memory available to Protege to at least 2G
- follow instructions under Knowtator/Protege runs out of memory here:
http://knowtator.sourceforge.net/trouble_shooting.shtml
3) Start Protege
4) From the File menu, select Open, then select the
protege/craft.pprj file that came with this distribution
5) If the Knowtator tab is not present, Project --> Configure, then
check the box next to Knowtator can click OK
6) Click on the "Knowtator" tab if it is not already selected and the
articles should appear with annotations overlayed
::: Use at U-Compare.org :::
CRAFT will soon be available for use in U-Compare (http://u-compare.org/)
==============================================================================
==== Feedback ====
==============================================================================
Please direct comments, questions, and suggestions to either the CRAFT email
list on SourceForge:
https://sourceforge.net/mailarchive/forum.php?forum_name=bionlp-corpora-craft