Bootstrap Dialog System Status 2008-04-25
Note that the system now can only understand and generate a single utterance. Progress towards a broader coverage of English awaits the completion of the Grammar Acquisition Skill, and the Vocabulary Acquisition Skill.
I have completed coding the English Generation Skill and I tested it on my use case utterance “the book is on the table“. As expected, I wrote the supporting Generation Rule Application Library with less effort than its counterpart on the parsing side - Parsing Rule Application Library. The same rule set is used by both libraries, and the method of rule application is similar. In my opinion, this is strong validation of the Fluid Construction Grammar engine that I adopted from Luc Steel’s work here.
As noted in my page on the English Comprehension Skill, my use case utterance “the book is on the table” has been elaborated with ambiguities to challenge the parser with respect to figuring out the right interpretation within the discourse context. Without much guidance from the literature on Natural Language Generation (NLG), I hypothesized that a cognitively-plausible generation system should prune generation alternatives at the earliest possible point and the beamwidth of the generation interpretation tree should be 1. I selected the following features to score the generation alternatives. I hope in the future to employ machine learning to optimise the weighting of the factors.
Generated utterances are favored:
- that have fewer words
- that reuse previously uttered words for a given meaning term
- that use words that the recipient is otherwise likely to know
- that have the least effort when performing a trial parse of the (partial) utterance
Presently I am satisfied with these features, while conceding that performing numerous trial parses during generation degrades the response time. For example, in the use case utterance, I included the following ambiguities for a total of six possible generation alternatives:
- book, tome and volume were treated as synonyms for the meaning cyc:BookCopy (i.e. a book)
- on and on top of were treated as equivalent constructions for the meaning texai:OnTopOf-SituationLocalized (i.e. the situation having something on top of something else)
- book means either cyc:BookCopy or texai:SheetsOfPaperBoundTogetherOnOneEdge, in which the latter term can be used for a pad of paper or a matchbook
These are the six alternative generated utterances:
- the book is on the table
- the book is on top of the table
- the tome is on the table
- the tome is on top of the table
- the volume is on the table
- the volume is on top of the table
Alternatives (2), (4), and (6) are disfavored because they are longer. Alternative (1) is disfavored because the trial parse has more effort to figure out which word sense of book is meant, given that the discourse context contains a reference to a cyc:BookCopy. The preferred utterance was therefore a tie between (3) and (5), and the system generated choice (5). Because a human would not make the same choice, I believe that book and volume are not truly synonymous as WordNet groups them. I’ll revisit this use case in the vocabulary acquisition skill when the user will have the opportunity to elaborate the semantics for these words so that they can be generated more appropriately.
The next task ahead for the Texai project is to write the Vocabulary Acquisition Skill. Unlike the most recently completed tasks that built upon earlier work, this one will require more analysis and design, beginning with a set of use cases. To obtain data, I performed a word frequency analysis of the 324,000 glosses (definitions) from the Texai lexicon. The results are in the project repository at SourceForge here. By being taught lexical stem and morphological constructions that cover only 15% of the most frequently used words, it should be possible to automatically parse 50% of the word sense definitions. That is, about 13,000 words out of 85,000 words can be used to parse approximately 150,000 word senses out of the 300,000 contained by the lexicon. About 12,000 WordNet word senses are already mapped to the OpenCyc portion of the Texai ontology, and that’s a good start.
Jim Smart on 25 Apr 2008 at 11:35 pm #
i think that perhaps the rule for favouring an utterance with ‘the least effort when performing a trial parse of the (partial) utterance’, as currently defined, is throwing an odd bias into the mix.
when a human constructs a sentence it is usually the case that the most common word for an object is often preferred in speech.
the visible side-effect of the rule as currently implemented is that the system favours the usage of unique words over common words - which without further weighting will produce sentences with a flowery vocabulary that may appear somewhat quirky.
for me it is somewhat easier to read ‘the book on the table’ than ‘the tome on the table’, the first example scans more easily. of course i fully understand what each phrase means if i read it. and when speaking i would certainly call a book a book - even if there was a book of matches on the table also.
do you think you will have to change the rule from (easier for texai to parse) to (easier for a human to parse)? or will other factors come into play to take care of this, such as having improved weighting in the rule ‘use words that the recipient is otherwise likely to know’, or perhaps by having a better understanding/model of the current context (i.e. knowing there are no other kinds of books in our reference)?
i am interested in your thoughts on this.
i think this whole project is truly fascinating!
best wishes,
/Jim
Jim Smart on 25 Apr 2008 at 11:44 pm #
apols. - i missed favouring ‘reuse previously uttered words’ from my monster question when discussing which other factors would come into play to result in the (expected? - well, for me at least) favouring of the word ‘book’ over tome or volume.
/J
Steve Reed on 01 May 2008 at 9:04 pm #
Hi Jim, currently the features that characterize alternative generation phrases are all weighted the same - 1.0. I expect that reinforcement learning, given a reward signal from the user, will converge on producing the results that you describe. It may turn out that the trial parse feature is not useful, in which case it’s learned weight will be near zero.
-Steve
AlexBt on 09 Jun 2008 at 3:59 pm #
Hi Steve,
First of all let me say that you are doing great work. I’ve downloaded the Texai lexicon and uploaded it into my Sesame 2 repository. Via RDFEntityManager I managed to pull information from it very easily. It’s truly great stuff!!!
But I have a question regarding lexicon creation.
What I need to do is to create a Russian language lexicon similar to what you’ve done. Can you describe the lexicon creation procedure that you utilized while transforming your MySQL WordNet database into the lexicon? How did you map WordNet word senses to the OpenCyc portion of the Texai ontology?
Regards,
Alex