Using Gate

First download GATE. It is recommeneded that you use either the Windows or MAC OSX installers or the generic installer (an executable jar file). Run the installer and follow the on screen instructions. The remainder of this page uses the symbol $GATE_HOME to refer to the directory where you installed GATE.

GATE requires that you have Java installed. NOTE: Most Windows computers do not come with Java pre-installed. If you are not sure what version of Java is installed on your computer open a command prompt (shell) and type the command:

java -version

If you get a command not found error, or if the version of Java installed is 1.4 or older you will need to download the latest Java version from Sun Microsystems. Be sure to download the Java JDK and not the Java JRE.

Installing the GrAF tools for GATE

The ANC now maintains a GATE plugin repository. See the GATE documentation for instructions on installing plugins from a plugin repository. The ANC’s repository is located at http://www.anc.org/tools/gate/gate-update-site.xml

More information on the Gate tools can be found here.

Annotation Types

There are several annotation types distributed with OANC and MASC data. All data have the following annotations:

Type Description
logical The logical structure of the document down to the paragraph level. These annotations are required to convert the document into a well-formed XML file.
s Sentence boundary annotations
penn Penn part of speech tags generated by a modified version of the ANNIE POS Tagger in GATE
nc Noun chunks
vc Verb chunks
ne Named entities

The type indicator associated with each annotation type is an id defined in the resource header distributed with each corpus, used in filenames and annotation spaces etc. in the corpus.

Notes :

Loading a GrAF Document

Method 1 (preferred)

Select GrAF Document under New Language Resource. The new document dialog will open. Fill in the following fields:

Click the OK button to close the dialog and open the document. Check the message tab for error messages. If all went well you should see the document listed in the left hand pane under Lanugage Resources (you may have to expand the Language Resources tree).

Method 2 (alternate)

If, for some reason, the above method does not work you can load the text file for the document directly. Create a GATE Document (File -> New language resource -> Gate document) and select the text file (with the.txt extension) as the sourceUrl. The New Gate documet dialog is similar to the the new GrAF document dialog, minus the fields for the standoff annotations, and you fill it in the same way. However, you must specify the encoding to be UTF-8. Once the text has been loaded follow the instructions below to load any standoff annotations.

Loading Additional Standoff Annotations

Before loading additional standoff annotations some initial set up will have to be done. However, this typically only has to be done once.

1. Create a Processing Resource (PR). Select New Processing Resource -> GrAF Load Standoff to load an individual annotation file, or Select New Processing Resource -> GrAF Load All Standoff to load all standoff annotations for a document. Enter a name for the PR (Load Standoff say) and click the OK button. The Load Standoff (or Load All Standoff) PR should appear in the left hand panel under Processing Resources.

2. Create a GATE Application.

3. Configure and run the application.

  1. Double click on the Load Standoff application in the left panel. This will open the Application Editor in the main window.
  2. Select the Load Standoff PR in the list of Loaded Processing Resources and click on the right arrow to move it to the list of Selected Processing resources.
  3. Click on the Load Standoff PR in the list of Selected Processing Resources to open the PR parameter editor in the bottom of the main window.
  4. Select the document (or corpus, if applicable) you would like to add the standoff annotations to.
  5. In the annotationType box, enter the annotation type that you want to load. Note that annotation files that are required for another annotation type are automatically loaded when the dependent annotation type is give.For example, the ptb (Penn Treebank Syntax) annotations require the ptbtok (Penn Treebank tokens with part of speech); if ptb is specified as the annotation type to load, ptbtok annotations are automatically loaded as well.
  6. Click the folder icon next to the sourceUrl field and navigate to the standoff annotation file you would like to add. Unlike opening an GrAF Document, where you select the .hdr header file, when loading standoff annotations separately you must select the XML standoff annotation file directly. Each annotation file name includes a -xx.xml suffix, where “xx” (may be longer than two characters) is the annotation type indicator for the contained annotation type.
  7. Set the standoffASName (optional). This is the name of the annotation set that the standoff annotations will be added to. In the image below, the tokens with lemma and Penn part of speech tags annotations (-penn.xml suffix) are added to the Standoff markups annotation set. If an annotation set with the specified name already exists, the new annotations will be added to the existing set; otherwise a new annotation set with that name will be created.
  8. Steps 4 – 7 can be repeated as desired.

Saving GrAF Standoff Annotations

The procedure for saving standoff annotations in GrAF format is the same as the procedure for loading standoff annotations:

  1. Create a processing resource. This time you will create an GrAF Save Standoff processing resource.
  2. Create a GATE application and add the Save Standoff PR to the new application.
  3. Configure the processing resource and run the application. The following fields need to be completed in the Save Standoffprocessing resource.
  4. annotationType: Enter the annotation type that you want to save. The name in this field will appear as a suffix to the name of the resulting GrAF XML file. In the example below, the resulting filename will be bartok-standoff.xml.
  5. destination – This is the location where the standoff annotation file will be saved. If the file does not exist, navigate to and choose the folder in which you want the file to be written. Otherwise, choose an existing file as the destination.
  6. document – Name of the document containing the annotations to be saved. The drop down box will contain a list of all the open documents.
  7. inputASName – Name of the annotation set containing the annotations to be saved.
  8. standoffTags – A list of the annotations that will be saved. If left blank all the annotations in the selected annotation set will be saved.

Gate Tips

  1. Be sure to specify UTF-8 as the encoding type when opening the text files directly.
  2. When exiting GATE, be sure to select Exit Gate from the File menu. Otherwise, GATE will not save its current state information andrestore open applications, documents, or processing resources the next time it i started up.
  3. Do not use files in a location with a space anywhere in the path, as GATE has problems with this format.
  4. If something does not work, check for error messages in GATE’s Message tab.
  5. If you get an error stating that no standoff annotations were found when running the SaveStandoff processing resource, the most likely causes are that you specified the name of the annotation set incorrectly, or you specified the standoff tags incorrectly. Both values are case sensitive.

© 2002-2015 American National Corpus Project. All rights reserved. Please contact us if you have any comments or questions.