About

TERRIER (Temporally Extended, Regular, Reproducible International Event Records) BETA is a new machine coded event dataset produced from a historical corpus ranging from 1979 to 2016, available for download at OSF. Event data generates structured records of political events described in text in the form of (1) a source actor (2) committing an action (3) against a target. The political events recorded in the dataset include a wide range of political behaviors: meetings, statements, provision of aid, protests, attacks, and violence. This dataset is an initial beta release of the data, lacking event geolocation. We encourage researchers to carefully check the data they use and to contact our team with any issues they uncover regarding the data by opening a thread on our discussion forum.

The dataset was produced by a team at the University of Oklahoma as part of the NSF RIDIR grant “Modernizing Political Event Data” SBE-SMA-1539302. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or the U.S. government.

Download the data from OSF.

Note: we are not affilied with the Terrier information retrieval group at the University of Glasgow. Check out the excellent work they do on their homepage.

The Team

Photo Description
Jill Irvine, PhD, Presidential Professor of International and Area Studies, University of Oklahoma
Christan Grant, PhD, Assistant Professor of Computer Science, University of Oklahoma
Andrew Halterman, PhD Candidate, MIT
Khaled Jabr, MA Candidate, University of Oklahoma
Yan Liang, PhD Candidate, University of Oklahoma

Getting Started

Download the data from OSF.

What is event data?

Event data, at its most basic, consists of a “triple” of information: an event, such as a protest or attack, performed by a source actor against a target. These events and actors are automatically recognized in text, extracted, and resolved to a defined set of codes, such that “demonstrated” and “rallied in the streets” would both be coded as a “Protest” event and “Angela Merkel” and “German Ministry of Defense” would both be represented as DEU GOV. Performing this process on many millions of documents produces a set of structured data that is much easier to analyze than the raw documents.

In producing event data, we build on the dominant paradigm of event coding in English, which consists of automatically comparing grammatically parsed sentence text with hand-defined dictionaries using an event coding tool. The tool follows instructions about how to combine the extracted noun- and verb phrases into a direct event with a source and target, and resolves the extracted text to specified set codes defined in an ontology. An automated event coding system thus consists of two components: a set of dictionaries that map noun and verb phrases to their corresponding actor and event codes in an ontology, and an event coder that applies these dictionaries to the text and makes decisions about how to combine individual actors and actions into coded events.

The ontology we use is the CAMEO ontology, which is the current standard coding ontology for most event data. CAMEO enforces a requirement for a source actor and target actor to go along with each event. Actors and events are each assigned hierarchical codes. Actor codes begin with high-level information comprising their country or international status, following by a functional role code, such as “GOV” or “MIL”, with secondary codes providing greater detail in some cases. Event codes can be aggregated into five top level classes, 20 intermediate event types, or around 200 low-level codes. Each code is documented in the codebook, available here.

For more details on how we created the Terrier event dataset, see our technical overview.

Terrier Dataset

The Terrier dataset was generated from roughly 200 million news stories from 500 news sources around the world, ranging in dates from 1979 into 2016. The raw text of each story was obtained from LexisNexis either though special API access and later through bulk dumps provided by LexisNexis on mailed hard disks. From these news articles, we produced around 60 million events.

The number of events produced per month (below) shows a marked increase over time, as more source text becomes available and as dictionary coverage improves. This exponential increase is familiar from other event datasets and poses challenges for researchers making over-time claims. Importantly, however, it shows no major missing time periods.

The events in TERRIER have initial geolocation information attached to them. The geolocation process arbitrarily selects a location extracted by CLIFF-CLAVIN from the sentence to the event in question. As with many maps, a high-level plot of geolocated events reveals good coverage of the world’s population density.

Codebook

The data available on OSF comes in two formats: JSON and CSV. The JSON data includes field names for each entry, and the TSV does not. These are the fields available for each event, presented in the order they occur within the TSV files. For more details on what the different codes represent, please consult the CAMEO manual.

  • “code” : The full CAMEO code of the coded event
  • “src_actor” : The highest level code (e.g. country) for the source actor
  • “month”: The two digit month code for the event (MM)
  • “tgt_agent” : The primary role code for the target actor (e.g. “GOV”, “MIL”)
  • “country_code” : The country the event is geolocated to (two digit ISO code)
  • “year” : The YYYY date of the event/story
  • “mongo_id” : The ID of the story in our database (mostly for internal use)
  • “source” : The name of the news source that published the story.
  • “date8” : The date of the event/story in YYYYMMDD format
  • “src_agent” : The primary role code for the source actor (e.g. “GOV”, “MIL”)
  • “tgt_actor” : The highest level code (e.g. country) for the target actor
  • “latitude” : The latitude of the geolocated event
  • “src_other_agent” : Other, secondary role codes for the source actor (semicolon separated)
  • “quad_class” : The “quad” class of the event. 1 = verbal cooperation, 2 = verbal conflict, 3 = material cooperation, 4 = material conflict. (0 = neutral statement-type events)
  • “root_code” : The CAMEO event root code (one of twenty)
  • “tgt_other_agent” : Other, secondary role codes for the targer actor (semicolon separated)
  • “day” : The day of the event/story in DD format
  • “target” : The full target actor code (primary code plus role code).
  • “goldstein” : a -10 to 10 conflictual-cooperative scale.
  • “geoname” : The place name the event was resolved to
  • “longitude” : The inferred longitude of the event
  • “url” : (Unused for LexisNexis-based stories)

Sources

The English language sources used in TERRIER include LexisNexis’ complete collection of articles published by the following sources between 1979 and early 2016. (Note that LexisNexis does not possess many articles for these sources in the 1980s and 1990s).

“Associated Press International”, “The Associated Press”, “BBC Monitoring”, “BBC SUMMARY WORLD BROADCAST”, “The New York Times”, “AFP”, “Xinhua”, “The Globe and Mail”, “ITAR-TASS”, “McClatchy Washington Bureau”, “AllAfrica”, “UPI”, “The Christian Science Monitor”, “European Press Agency (EPA)”, “South China Morning Post”, “AAP Newsfeed”, “The Guardian”, “The Straits Times (Singapore)”, “The Baltimore Sun”, “Today’s Zaman”, “The Sydney Morning Herald (Australia)”, “Jerusalem Post”, “Dawn (Pakistan)”“, “Ghana News Agency”“, “The Straits Times Singapore”, “Guardian.com”, “IPS - Inter Press Service”, “Belfast Telegraph”, “The Philadelphia Inquirer”, “Hindustan Times”, “Russia & CIS General Newswire”, “Russia & CIS Energy Newswire”, “Ukraine General Newswire”, “Russia & CIS Business and Financial Newswire”, “Kazakhstan General Newswire”, “Czech Republic Business Newswire”, “Russia & CIS Military Newswire”, “Poland Business Newswire”, “Hungary Business Newswire”, “Central Asia General Newswire”, “China Energy Newswire”, “China Mining and Metals Newswire”, “China Telecommunications Newswire”, “China Pharmaceuticals & Health Technologies Newswire”, “Philadelphia Inquirer (Pennsylvania)”, “Knight Ridder Washington Bureau”, “The Japan News”, “The Daily Yomiuri(Tokyo)”, “The Business Times Singapore”, “ABC Premium News (Australia)”, “FARS News Agency”, “The Press (Christchurch, New Zealand)”, “The Baluchistan Times”, “Detroit Free Press (Michigan)”, “TASS”, “CNN Wire”, “The Nation (AsiaNet)”, “DAILY MAIL (London)”, “MAIL ON SUNDAY (London)”, “The Nation”, “THE KOREA HERALD”, “Central Asia & Caucasus Business Weekly”, “The Japan Times”, “Facts on File World News Digest”, “Baltic News Service”, “Daily Mail (London)”, “Mail on Sunday (London)”, “WAM Emirates News Agency”

Technical Details

Producing event data requires passing text through a series of tools to grammatically parse sentences, recognize events, and record them in a structured format. This process depends on recognizing actors, events, and targets in text and comparing them to hand-built dictionaries to produce standardized actor and event codes. TERRIER was produced with several open source tools and tools produced and maintained by the Open Event Data Alliance.

Grammatical parsing

The first step in producing event data is to annotate the text with grammatical markup to provide information about the structure of sentences and the syntactic relationships between different parts of the sentence. The step automatically identifies noun- and verb phrases and the relationships between them, making it easier to determine who the actors are and what events are occurring.

CoreNLP

To perform this step, we draw on the large body of work conducted in computational linguistics and natural language processing over the past two decades. Specifically, we use Stanford University’s CoreNLP to provide provide a constituency parse of each document.

Biryani

Because of the size of our corpus (~2TB, 300 million stories), running CoreNLP was not a trivial task. We developed a distributed task-queue tool for distributing CoreNLP jobs across a cluster of machines to speed processing.1 Our tool, biryani, uses a Kalman filter to dynamically adjust the batch size and thread count in processing. More details are available in an article here.

Event Coding

The second major step in producing event data is to recognize political events in text, which words in the sentence correspond with actors, targets, and events, and which codes to assign to each actor or event.

Once CoreNLP generates grammatical information on the document, we are left with the task of determining which noun phrases correspond to our “source” and “target” actors, and which verb phrases could be events. In addition to finding these spans of text, we also want to resolve them to predefined categories to make them easily analyzable for social science research.

Petrarch2

The heart of our event data pipeline is Petrarch2, which locates actors, events, and targets in text, compares them to dictionaries that map short phrases to actor- and event codes, and returns a complete event. Petrarch2 is a well known workhorse in automated event data. It is available for download here and is described in a white paper here.

Birdcage

We produced a new pipeline to run Petrarch2 at scale over many millions of documents. Although Petrarch2 is quite fast, it is natively parallel and also requires slower pre- and post-processing steps, including geolocation and final formatting. To bundle all of these steps together, we created Birdcage, a distributed pipeline that can quickly generate event data from CoreNLP-processed text.

The final dataset is available for download from OSF.


  1. Although a nice Spark wrapper for CoreNLP exists, we preferred our simpler distributed approach because of its portability across systems and our desire to not depend on maintaining a Spark cluster. [return]

Publications

Yan Liang, Christan Grant, Andrew Halterman, Jill A. Irvine, Khaled Jabr. New Techniques for Coding Political Events Across Languages. IEEE 18th International Conference on Information Reuse and Integration. Salt Lake City, UT. 2018. [pdf] [slides]

Andrew Halterman, Jill A. Irvine, Christan Grant, Khaled Jabr, Yan Liang. Creating an Automated Event Data System for Arabic Text. Annual Meeting of the International Studies Association (ISA). San Francisco, CA. 2018. [pdf] [slides]

Andrew Halterman, Jill Irvine, Manar Landis, Phanindra Jalla, Yan Liang, Christan Grant, Mohiuddin Solaimani, “Adaptive scalable pipelines for political event data generation,” 2017 IEEE International Conference on Big Data (Big Data). [link] [paper] [slides]

Andrew Halterman. Mordecai: Full Text Geoparsing and Event Geocoding. The Journal of Open Source Software, Vol 2, No. 9 (2017).

Discuss