Texai - The Year in Review 2007
As we close 2007, it is a good time to review this year’s project milestones on the path to creating artificial intelligence.
- In January 2007, I released a KB Reference Manual describing the then current MySQL KB implementation. At that time the preliminary merge of the OpenCyc content, with WordNet, and the CMU Pronouncing Dictionary was complete and the KB grew to over 10 million statements.
- My paper, Semantic Annotation for Persistence was accepted for publication at the AAAI-07 Workshop for Semantic E-Science in Vancouver, Canada.
- In February 2007, I completed the extraction of structured content from the Wiktionary English Lexicon, which increased the MySQL KB to approximately 15 million statements.
- Inspired by Dr. Jerry Ball’s publications on Double R Theory, I was led to study both Construction Grammar and Walter Kintsch’s Construction / Integration theory of reading comprehension. For the latter I wrote a Java class that duplicates the spreading activation result of Kintsch’s worked out example from his book.
- I wrote a simple dialog use case in which the user says “hello” to the dialog system via a Jabber compatible Smack API. Because there was no existing Java construction grammar implementation, I wrote my own that could handle the use case. I decided to create a Texai Lexicon that would support my grammar, and began merging the lexical entries from OpenCyc, WordNet, The CMU Pronouncing Dictionary, and Wiktionary. Ultimately this effort would take the KB over 20 million statements. I became concerned as the MySQL KB implementation began to slow down during the conversion of the Texai Lexicon.
- From early Spring to mid Summer 2007 I rewrote the Texai KB implementation twice in search of an ideal trade-off between (1) expressiveness, (2) speed, and (3) space utilization. In terms of expressiveness, the original Hibernate/MySQL implementation was a faithful implementation of CycL, the language of the Cyc KB, with which I am very familiar given my seven years at Cycorp. Wanting to retain CycL compatibility, my first rewrite was into Oracle Berkeley DB Java Edition database, which sacrifices SQL compatibility for increased speed. Without SQL, I had to write my own mapping layer in lieu of Hibernate, but because Hibernate is open source I could adopt all of Hibernate’s speed and caching optimizations by inspection of their classes. This work resulted in the first version of the RDF Entity Manager which enabled semantically annotated Java objects to be persisted in the Oracle Berkeley DB. While the speed increase met my expectations, Berkeley DB Java Edition required over three times the disk space of MySQL. About this time I attended the AAAI-07 conference in Vancouver to present my workshop paper and was again impressed with the vitality of the RDF community and their progress over the years after I first exported OpenCyc to RDF at Cycorp. Subsequent research led me to evaluate Sesame 2. This highly regarded RDF store is enhanced in version 2 to support RDF named graphs, which can express single-level context. By September 2007, I rewrote the RDF Entity Manager to access Sesame 2 and was totally satisfied with the speed and space utilization of Sesame 2 as the Texai KB store. Export to RDF and import from RDF is trivial with Sesame, which enables Texai to easily particpate in the Semantic Web.
- In the midst of the KB store rewrite, I contacted Dr. Hans Boas at the University of Texas Linguistics department and subsequently met with him to discuss the Texai project and in particular Construction Grammar. I came away from that meeting with growing confidence that Construction Grammar was ideally suited for deep understanding of English and parsing to logical statements.
- In September, I released RDF exports of the OpenCyc content, WordNet 2.1 content, The CMU Pronouncing Dictionary, the Wiktionary, and the merged Texai Lexicon on SourceForge. I also released the RDF Entity Manager and Texai Utilities packages. As the year closes, there have been over 630 files downloaded and the project ranks in the top half of one percent of all projects hosted at SourceForge.
- Stymied by a need to solve how semantics could be connected between child and parent constructions in the parse of an utterance I turned again to the literature and discovered that this problem has been solved by the Fluid Construction Grammar project. I contacted Joachim De Beule and found that their open source implemention does not have broad coverage of English and is implemented in Lisp. I was encouraged by him to use FCG. By November 2007, I released my Java FCG implementation that correctly parses their use case sentence “Jill slides blocks to Jack”.
- In November I also attended the AAAI Fall Symposia in Arlington, Virginia, USA. The track that I registered for was focused on Cognitive Approaches to Natural Language Processing. This track was organized by Dr. Jerry Ball. I learned not only that NL parsers should operate word-by-word as people do, but that Dr. Ball’s Double R Grammar was well worth study. I decided to marry FCG with Double R Grammar. The former is an excellent rule-based, semantics-focused, parsing and generation engine. And the latter is a semantics-focused grammar.
- By the close of 2007 I enhanced the Java Fluid Construction Grammar system to operate incrementally, and to parse one of Dr. Ball’s use case sentences “the block is on the table”. Because FCG is a bidirectional grammar, I could parse logical RDF statements from the input sentence, and then turn around and use the same grammar rules to generate the original sentence from the logical statements. This feature is very important as it avoids the need to develop and maintain a separate NL generation system for dialog.
- In December I learned from my friend Georgia Harper that I should put more energy into this website and blog, and also to seek collaborations. To date, the blog has received over 1800 visitors and collaborations have begun to develop.
To summarize the year: I did not accomplish all that I wanted; the goal remains of building a bootstrap English dialog system that intelligently acquires knowledge and skills. However, the fact that I have seen an English utterance converted fully from text into logic, and back again, gives me great confidence that the Incremental Fluid Construction Grammar engine, and the constructions from Double R Grammar can be extended to meet the needs of Texai bootstrap dialog.
Jean-Paul on 01 Jan 2008 at 11:53 am #
well-done - Good to see someone taken a different approach and not just talking about it but getting down to coding. You’re a motivating example. Keep up the good work in 2008!
PS Why not add also ConceptNet to your KB?
Steve Reed on 01 Jan 2008 at 5:37 pm #
I looked at ConceptNet three years ago when I was at Cycorp. The first issue is that it is not truely open source - if you are a commercial user there is a commercial license involved. The second issue is the quality of the KB. When I looked at it there was a substantial amount of joke or sarcastic assertions that would entail a manual filtering before combining with the much higher quality OpenCyc, WordNet and even Wiktionary content already in the Texai KB. And the final issue is the lack of precision in the assertions because the KB was gathered from filled-in web forms that were not subject to semantic well-formedness checks.
Once the Texai bootstrap dialog system is operational, it should overcome these objections that I have with ConceptNet.
-Steve
Arthur T. Murray on 04 Jan 2008 at 11:29 pm #
The “Year in Review 2007″ is an interesting read. I was struck by how much this project tries to gobble up enormous corpora of lexical and conceptual information. In my own Mind.Forth project, I start out with a minimal innate knowledge base (KB) and let users add simple Subject-Verb-Object (SVO) factoids. Then the AI Mind tries to maintain meandering chains of NLP-generated thought about the contents of the KB, and also in response to user input. Most recently (in 2008) I am trying to get the AI Mind to back out of generating thoughts where insufficient information is known in the KB — specificially, where a subject noun is proposed but no verb is available to complete a statement. Then the AI is supposed to go by default to a question asking the human user for more information. If the human user makes no response, the AI is supposed to continue thinking — either self-referentially because of an Ego module, or in a traversal of the KB (not yet coded :-). In sum, it is interesting to see such an ambitious AI project and the approach being taken — apparently an approach of organizing massive, Cyc-like data, whereas my own approach is to _grow_ the artificial mind from a starting point of minimal knowledge, adding more mental functionality as the mindcore is debugged and permits tack-on functionality. Otherwise — Happy New Year 2008! -Arthur T. Murray
Joe Simone on 11 Jan 2008 at 3:54 am #
Hi Steven,
I just happened to stumble on this web site!
Last I heard you were at Cyc corp? I guess you are now striking out on your own. I have been following Doug Lenat on and off for about 15 years - since around 1992. I used to think that Cyc would eventually hit pay dirt. I am not so sure anymore. OpenCyc seems to be a bust as well as it is so poorly supported and lacks the vibrancy of an active community. I look at the posts on source forge and very rarely do questions get answered. Both the OpenCyc and Cyc web sites seem not to be updated frequently if at all. Overall I see a lack of progress and stagnation - but that’s because I can only see from the outside and don’t know what going on inside cyc land. Hopefully Texai will fair better.
Have you looked at using/incorporating the OpenMind KB? I participate in OpenMind and it’s enjoyable and amusing sometimes to teach new knowledge. I’ve played the Cyc FACTory game and it is pitiful. Oh well.
If you have the time, would you be so kind as to give a brief summary of the current state of affairs in pursuit of a common sense AI? If you do not wish to post here, you can just email me directly. Is Cyc destined for failure? Are we even close to bootstrapping AI such that it can learn from dialog?
Best regards for a great and productive 2008!
Joe
Steve Reed on 11 Jan 2008 at 9:47 pm #
Hi Joe,
After working at Cycorp seven years, I was let go in August 2006 during a massive downsizing due to cuts from a major US government sponsor. I appealed to my wife and our investments for support so that I could pursue my own AI research.
Perhaps as you know, John DeOliveira and I were the instigators of OpenCyc. After we left, and also as a result of the reduced funding, OpenCyc has indeed stagnated. In the meantime Johd D founded the Cyc Foundation and its mission is to augment OpenCyc via publically accessible tools. I attend their monthly face-to-face meetings in Austin.
In contrast to Cycorp, which is a for-profit company that must protect its intellectual capital, my project is no-profit and open source from scratch. I plan to create artificial intelligence by writing, with least effort, a bootstrap English dialog system that intelligently acquires knowledge and skils, and then recruiting a multitude of volunteers to continue the work.
Because OpenMind is related to ConceptNet, please see my above comment.
I will write the post you suggest immediately.
-Steve
-Steve