OpenCyc.org HomepageCyc®-NL Documentation

E-Mail Comments to: opencyc-doc@cyc.com
Last Update: 3/28/2002
Copyright© 1996-2002 Cycorp. All rights reserved.

Return to Table of Contents

This is an introduction to Cyc's natural language processing System (Cyc-NL).

Contents:


Introductory Information About Cyc-NL

Cyc-NL is the natural language processing system associated with the Cyc knowledge base. The OpenCyc NL release includes a lexicon, a morphology component and a generation system, each of which is described in more detail below. In future versions, several parsers will be included as well.

Our goals for the Cyc-NL system include both understanding and generation. In the first case, we want to translate natural language texts into CycL, Cyc's internal representation language (see The Syntax of CycL for more information). In the second case, we want to provide natural language translations for CycL expressions. Building upon these two capabilities, Cyc and the Cyc-NL system can be applied to a wide array of tasks, including document indexing and retrieval, database querying, machine translation, enhanced speech recognition, and so on.

Large-scale natural language processing (NLP) is a famously difficult undertaking, for many reasons. Syntactic parsing, lexical representation, semantic interpretation, pragmatic processing, and discourse management, among other tasks, would be required to produce a truly functional natural-language dialogue system. Researchers have made important strides in many of the above-mentioned areas, but artificial language technology available today does not come close to mimicking conversation between human beings. See, for example, Ron Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Batista Varile, Annie Zaenen, Antonio Zampolli, and Victor Zue (eds.), "Survey of the State of the Art in Human Language Technology (1997)" for a discussion of the challenges presented by various aspects of natural language processing, and the level of performance that existing NL systems can achieve.

We believe that any application relying on large-scale natural language processing would benefit from a broad and deep repository of commonsense knowledge such as Cyc. Here are a number of representative linguistic problems which NL systems have been called upon to solve in the past few decades, and for which, we claim, an adequate solution requires an NL system to work in tandem with a knowledge base like Cyc:

Lexical Ambiguity and Polysemy
The majority of the most common words in English have multiple meanings: "bat", "bank", "table", "can", "will", etc. In applications such as machine translation and document indexing/retrieval, it is crucial to be able to figure out which meaning of an ambiguous word is intended. For example, if a user queries a text database for "bats and other small mammals", a standard Boolean search engine will also deliver documents about baseball bats, even though it is obvious to any reasonable human that this is not what the querier intended. The background knowledge in Cyc can be used in attempting to choose the most appropriate meaning of a word in context.

Syntactic Ambiguity
Syntactic ambiguity occurs when an input string has more than one possible syntactic structure. For example, in the sentence "Fred brought German beer and potato chips to the party", syntactic information alone cannot tell us whether both the potato chips and the beer were German, or whether just the beer was German. As with lexical ambiguity, NL applications need to be able to select the intended structure. Another pervasive type of syntactic ambiguity is prepositional phrase attachment. In the sentence "John washed the dishes on the table", the syntax alone allows two possibilities: John washed the dishes which are now on the table, or John did the dishwashing on the table. A system with access to a repository of commonsense knowledge like Cyc would have an advantage over non-knowledge-based systems in ruling out the second interpretation in a principled manner. Cyc knows that dishwashing typically occurs in sinks or dishwashers, not on tables, and this information could be used to select the appropriate syntactic structure underlying the sentence.

Coreference Resolution
In interpreting text, new pieces of information must be integrated with what has been mentioned before. A pronoun may be used to refer back to something already in the universe of discourse. For example, in "Fred went to the store, bought a candy bar with his credit card, walked to the park, and ate it", the pronoun "it" refers back to "candy bar". Syntactically, the potential referents of "it" are the nouns "store", "candy bar", "credit card", and "park". Cyc has rules from which it can infer that parks, stores, and credit cards aren't things which are normally eaten, and can thus help in determining that "candy bar" is the only referent for "it" which makes sense in this example.

The Cyc Lexicon

The Cyc lexicon is a subset of the KB which contains all the information about particular words needed by the parser to translate English expressions into CycL. This lexical information is also used by the English generation system to generate English text from CycL expressions. Currently, lexical information is only available for English. However, we have striven to make the Cyc ontology as language-neutral as possible, so that, in the future, lexicons for multiple languages can be integrated.

The basic kinds of information represented in the lexicon are:

Representing Words In The Lexicon: Syntactic Information
Information in the Cyc lexicon centers around "word units", which are actually instances of the collection #$EnglishWord. Each word unit represents information about a root word in English. That information is organized by part of speech and word sense. For example, the word "bat" is represented by the Cyc constant #$Bat-TheWord. "Bat" can be a noun; as a noun, it has (at least) three word senses. "Bat" can also be a verb; as a verb, it has (at least) three word senses.

Syntactically, the crucial information about a word includes its part of speech, and the features it carries (plural, past tense, etc.). This information is recorded using predicates which are instances of the collection#$NLSyntacticPredicate. For each of the main parts of speech, there is a set of corresponding predicates. Each predicate within a set indicates a distinctive feature for members of that part-of-speech (pos) category. For example, the predicates #$singular and #$plural are used to represent information about nouns. The former indicates a singular form, and the latter indicates a plural form. Predicates like #$singular and #$plural relate a Cyc word unit to a character string. From these assertions, forward inference derives assertions of the form (#$posForms [word-unit] [pos]), where [pos] is an instance of the collection #$SpeechPart. #$SpeechPart contains all parts of speech recognized by the Cyc-NL parser. Example uses of some of the syntactic predicates, and the #$posForms statements which they trigger, follow.

Count Nouns

[NOTE: #$SimpleNoun will be renamed as #$CountNoun in (or before) OpenCyc v1.0.]

(#$singular #$Dog-TheWord "dog")
(#$plural #$Child-TheWord "children")
(#$posForms #$Dog-TheWord #$SimpleNoun)
(#$posForms #$Child-TheWord #$SimpleNoun)

Mass Nouns

(#$massNumber #$Sand-TheWord "sand")
(#$posForms #$Sand-TheWord #$MassNoun)

Proper Nouns

(#$pnSingular #$Honda-TheWord "Honda")
(#$pnPlural #$Welshman-TheWord "Welshmen")
(#$pnMassNumber #$Valium-TheWord "Valium")
(#$posForms #$Honda-TheWord #$ProperNoun)
(#$posForms #$Welshman-TheWord #$ProperNoun)
(#$posForms #$Valium-TheWord #$ProperNoun)

Agentive Nouns

(#$agentive-Sg #$Type-TheWord "typist")
(#$agentive-Pl #$Fish-TheWord "fishermen")
(#$agentive-Mass #$Cleanse-TheWord "cleanser")
(#$posForms #$Type-TheWord #$AgentiveNoun)
(#$posForms #$Fish-TheWord #$AgentiveNoun)
(#$posForms #$Cleanse-TheWord #$AgentiveNoun)

Adjectives

(#$regularDegree #$Red-TheWord "red")
(#$comparativeDegree #$Red-TheWord "redder")
(#$superlativeDegree #$Red-TheWord "reddest")
(#$nonGradableAdjectiveForm #$Main-TheWord "main")
(#$posForms #$Red-TheWord #$Adjective)
(#$posForms #$Main-TheWord #$NonGradableAdjective)

Adverbs

(#$regularAdverb #$Happy-TheWord "happily")
(#$comparativeAdverb #$Good-TheWord "better")
(#$superlativeAdverb #$Good-TheWord "best")
(#$posForms #$Happy-TheWord #$Adverb)
(#$posForms #$Good-TheWord #$Adverb)

Verbs

(#$infinitive #$Eat-TheWord "eat")
(#$thirdPersonSg #$Eat-TheWord "eats")
(#$gerund #$Eat-TheWord "eating")
(#$pastTense #$Eat-TheWord "ate")
(#$perfect #$Eat-TheWord "eaten")
(#$posForms #$Eat-TheWord #$Verb)

To relate a word unit to a string where the part of speech is not one of the main content classes outlined above, the predicate #$partOfSpeech is used. Some example uses of this predicate follow:

(#$partOfSpeech #$From-TheWord #$Preposition "from")
(#$partOfSpeech #$Since-TheWord #$SubordinatingConjunction "since")
(#$partOfSpeech #$Myself-TheWord #$ReflexivePronoun "myself")

Proper names can be represented in different ways in Cyc. #$ProperCountNoun and #$ProperMassNoun are useful for names that act like words (in other words, names that can undergo inflection such as pluralization, for example "Frenchman" or "Christmas"). In other cases, where names are generally inflexible in form, instances of #$ProperNamePredicate-General should be used. Specialized proper name predicates are available for names of people, places, and other categories. Here are some examples of the uses of these predicates:

(#$givenNames #$FredSmith "Frederick")
(#$familyName #$FredSmith "Smith")
(#$nicknames #$FredSmith "Freddie")
(#$acronymString #$UNICEF "UNICEF")
(#$initialismString #$YMCA-Organization "YMCA")
(#$PlaceName-LongForm #$Cambodia "the People's Republic of Kampuchea")

Representing Words In The Lexicon: Semantic Information
The semantic information contained in the Cyc lexicon forms the core of our NL system. The power of Cyc-NL comes from the links between word units and Cyc concepts. Cyc provides a clean semantics to map words of the Cyc lexicon onto.

The most basic word-to-concept link is expressed with the predicate #$denotation. Here are a few example #$denotation assertions:

(#$denotation #$Bat-TheWord #$SimpleNoun 0 #$Bat-Mammal)
(#$denotation #$Bat-TheWord #$SimpleNoun 1 #$BaseballBat)
(#$denotation #$Bat-TheWord #$Verb 0 #$BaseballBatting)
(#$denotation #$Red-TheWord #$Adjective 0 #$RedColor)
(#$denotation #$Walk-TheWord #$Verb 0 #$AnimalWalkingProcess)

The first argument to #$denotation is an instance of #$LexicalWord. The second argument is an instance of #$SpeechPart. The third argument is an integer representing the word sense number, and the fourth argument is a Cyc constant. The first assertion above, for example, means that word sense number 0 of the noun form of "bat" denotes the Cyc concept #$Bat-Mammal. "Bat" as a noun has another word sense, which denotes #$BaseballBat. It should be noted here that the word sense numbers do not indicate frequency of occurrence of a particular word sense; they simply act as unique identifiers of word senses.

Another useful semantic predicate is #$denotationRelatedTo. This is used in cases where an exact mapping between a word and a Cyc concept is not available, but where you want to indicate that there is some relation between the two. For example, "shuffling", "perambulating", and "striding" are all types of walking, but currently, there are not distinct Cyc concepts representing each of these forms. In this case, then, we could assert

(#$denotationRelatedTo #$Shuffle-TheWord #$Verb 0 #$AnimalWalkingProcess)
(#$denotationRelatedTo #$Perambulate-TheWord #$Verb 0 #$AnimalWalkingProcess)
(#$denotationRelatedTo #$Stride-TheWord #$Verb 0 #$AnimalWalkingProcess)

Besides the #$denotation links, the Cyc lexicon also provides more precise mappings between words and phrases and Cyc concepts.

For many nouns, the #$denotation link is all that is needed to specify the meaning of a word in CycL. In some cases, though, the meaning of a noun is a more complicated Cyc formula. Here, we use the predicate #$nounSemTrans to express that relation. Here are a couple of examples:

(#$nounSemTrans #$Bachelor-TheWord 0 (#$and (#$isa :NOUN #$AdultMalePerson) (#$maritalStatus :NOUN #$Single)))

(#$nounSemTrans #$Barmaid-TheWord 0 (#$and (#$isa :NOUN #$Bartender) (#$isa :NOUN #$FemalePerson)))

The first rule states that the word "bachelor" can be used to refer to an unmarried adult male. The second rule states that the word "barmaid" can be used to refer to a female bartender.

For verbs, adjectives, and adverbs, simple denotation rules are provided, but more precise mapping templates are required as well. Verbs act as the "glue" in a sentence; verbs display different argument patterns, and it is important to specify how these argument positions are filled in. Look at these examples:

John bores Fred.
John likes Fred.
John gave Fred a car.
John wanted Fred to leave.

In the first sentence, we must capture the fact that it is the direct object, Fred, who is experiencing the boredom. The second sentence looks similar on the surface, but here it is the subject, John, who is experiencing the liking. The last two sentences demonstrate verb-argument patterns other than the simple transitive structure seen in the first two sentences.

Mappings for verbs are given by assertions using the predicate #$verbSemTrans. Here are a few examples:

(#$verbSemTrans #$Eat-TheWord 0 #$TransitiveNPCompFrame (#$and (#$isa :ACTION #$EatingEvent) (#$doneBy :ACTION :SUBJECT) (#$inputsDestroyed :ACTION :OBJECT)))

(#$verbSemTrans #$Feed-TheWord 0 #$DitransitiveNPCompFrame (#$and (#$isa :ACTION #$FeedingEvent) (#$fromPossessor :ACTION :SUBJECT) (#$objectOfPossessionTransfer :ACTION :OBJECT) (#$toPossessor :ACTION :INDIRECT-OBJECT)))

(#$verbSemTrans #$Like-TheWord 0 #$TransitiveNPCompFrame (#$likesObject :SUBJECT :OBJECT))

The first rule provides a template for interpreting "eat" when used transitively. The rule gives a CycL translation, with keyword variables as "placeholders" for arguments which will fill them. In this rule, slots are reserved for the syntactic subject and direct object. The second rule gives a translation for "feed" when used ditransitively, as in "I fed the horse an apple." In addition to subject and object slots, this rule also provides for an indirect object slot. The third rule demonstrates that verbs may refer to constants which are not #$Events. Many verbs have "stative" meanings; they denote not an action, but a state of affairs holding. In these cases, typically, the verb will translate into a Cyc predicate. In this case, transitive "like" maps onto the predicate #$likesObject. Each of these rules mentions the appropriate #$SubcategorizationFrame in which the given translation holds.

Finally, Cyc-NL has mechanisms for handling phrasal structures. See #$multiWordString and #$compoundString for information on how to handle multi-word terms like "swimming pool" or "attorney general".

The Morphology Component

The Cyc-NL morphology component includes a set of SubL functions which handle the recognition and generation of regular inflectional forms of root words. This greatly reduces the number of assertions we need in the Cyc lexicon.

For nouns, the plural is the only inflectional variant. A noun is assumed to have a regular plural variant if the plural is formed by adding "-s" or "-es" to the singular. So, "lunch" and "dog" have regular plurals, while "child" and "deer" do not.

For verbs, the infinitive is the root form, while the inflected forms are the gerund, the third person singular, the past tense, and the perfect. The gerund form is regular if it is formed by adding "-ing" to the infinitive (whether or not a consonant is doubled; so, both "singing" and "mapping" are considered regular forms). The third person singular form is regular if it is formed by adding "-s" or "-es" to the infinitive. The past tense and perfect forms are regular if they are formed by adding either "-d", "-ied", or "-ed" to the infinitive. So, "baked", "hurried", and "washed" are regular, but "eaten" and "went" are not.

Regularly inflected verb and noun forms need not be entered in the lexicon. Therefore, all uses of #$plural, #$gerund, #$pastTense, #$thirdPersonSg, and #$perfect should involve irregular forms.

For derivational morphological variants, instances of #$DerivedWordFormingFunction can be used to create new lexical items. For example, a lexical entry for "unhappy" can be composed as follows:

(#$WordWithPrefixFn #$Un_Neg-ThePrefix #$Happy-TheWord). If the appropriate rules are stated using #$baseForm, #$posBaseForm, and #$generalSemantics, important syntactic and semantic information will be inferred about the derived word.

It should be noted that these derivational morphology functions are quite new, and have not yet been fully implemented. Therefore, you will find word units of both forms in the lexicon: compositional word units like

(WordWithPrefixFn Un_Neg-ThePrefix Happy-TheWord)

as well as atomic word units like this:

Unconscious-TheWord

The Parsers

Parsing in Cyc relies on a number of different components. Some existing Cyc-based applications rely on a hybrid top-down/bottom-up parser which allows for a balance between speed and flexibility.

The Text Processor guides the NLU process. It controls the application of the various parsing subcomponents, using a heuristic best-first search mechanism that has information about the individual parsers, their applicability to coarse syntactic categories, cost, expected number of children, and so on. This information is used to perform a syntax-driven search over the parse space, applying relevant parsers to the sub-constituents until all are resolved, or until the parsing options have been exhausted. The parsers at the disposal of the Text Processor are the Template parser, the Noun Compound parser, and the Phrase Structure parser.

The Template parser is essentially a string-matching mechanism driven by a set of templates compiled into an efficient internal format. These templates, like those used for generation, employ a simple format so that users can add templates as they are entering new knowledge into the system. The template parser is relatively fast, but is of limited flexibility. It tabulates semantic constraints during a parse, but does not attempt to verify them; that task is passed along to the next processing layer.

The Noun Compound parser uses a set of semantic templates combined with a generic chart-parsing approach to construct representations for noun compounds such as ``anthrax vaccine stockpile''. Unlike other parsing components, it makes heavy use of the Cyc ontology, and can therefore resolve many ambiguities that are impossible to handle on a purely syntactic level (e.g. ``Mozart symphonies'' vs. ``Mozart expert''). See #$NounCompoundRule for examples of rules used by the Noun Compound parser.

The Phrase Structure parser takes a similar bottom-up approach to constructing parses. After completing a syntactic parse, it uses semantic constraints gleaned from the KB to perform pruning and to build the semantic representation. Specialized sub-parsers are used to parse noun phrases and verb phrases; resulting constituent parses are combined to produce a complete semantic translation.

NL Generation

The natural language generation system produces a word-, phrase-, or sentence-level paraphrase of KB concepts, rules, and queries. The NLG system relies on information contained in the lexicon, and is driven by generation templates stored in the KB. These templates are not solely string-based; they contain linguistic data which allows, for example, for correct grammatical agreement to be generated. The NLG system is capable of providing two levels of paraphrase, depending on the demands of the application. One type of generated text is terse but potentially ambiguous, and the other is precise but potentially wordy and stilted. Through the Dictionary Assistant Tool, users can add generation templates for new terms as soon as they are introduced into the knowledge base. See #$genTemplate for examples of NLG templates.

The Cyc NLG system allows users to view any rule or statement in Cyc as an English sentence instead of a CycL formula. In order to see the English version of a CycL statement, click on the ball next to the rule, and then, on the next page, click [Show English]. You can also choose to see all assertions in English as the default, by going to the "Options" menu and selecting "Show assertions in English".


Go to Top