Sunday, February 8, 2009

Formal Systems,Formal Logic Systems and Natural language processing.

Recently I read a post on the Usenet.
http://groups.google.com/group/comp.ai.nat-lang/t/5855301973b928da?hl=en
about How to formalize a natural language and I felt there are two issues
1.Representing arbitrary set of natural language statements (of finite length-say 10 pages) in a formal system.
2.Solving them on a computer.
Formal System is not equal to Formal Logic System
while Formal systems may use Formal logic they need not do so. To qualify for the attribute Formal it would have to meet certain requirements. See for example http://formalsystemsphilosophy.blogspot.com/ to get a feel for what I mean
Well what does it mean to formalize in the first place?
To formalize is to agree on basic terminology , methods of reasoning, interrelations of terms in the vocabulary/lexicon and generally to have a good model for the stated purpose. Please note that there may be different models on the same domain and serving different purposes or the same purpose with different efficiencies. Models for NLP range from N-Gram models to PCFG to HPSG. While HPSG/LFG may not perhaps be counted as models. Any form of grammar is in essence a model of the language.

To me a model of the language should account for

  • Observed natural language phenomena
  • Be kind to the context-I mean it must be Context Sensitive
  • Account for words, their interrelation ships and correspondence to the physical world.
  • Rephrasing should be possible in a mechanical manner by a human using the model.
  • Linguistic entailment can be worked out using the model.
  • Inferencing using the model and accepted forms of inferencing should solve most if not all problems.

Now I am saying that Humans should be able to work with the model. Computers can follow later if possible.

Is there a good language model?

The answer sadly is a definite no in the sense as above. All models used by the AI/NLP community draw their inspiration from

  • Statistical processing--ngram,Maxent,CRF
  • Statistics with grammar--PCFG
  • Pure Grammar models--HPSG/LFG

Limitations of all Statistical Methods

Machine learning algorithms do not process arbitrary language inputs-they can at best cater to a small subset of the language.
Any machine learning technique has to use statistical processing of one sort or the other and to put it crudely "statistical processing is tossing a coin to decide if the human in front of you is male or female".Before you do number crunching using Maxent or CRF or SNLP you need to know
what numbers you are crunching.At one point of time Statistical parsing was supposed to be the ultimate. And Today people are talk in terms of using it in conjunction with HPSG etc.The remarks apply to all statistical techniques irrespective of the domain. Einstein was supposed to have said "God doesn't play dice" in the context of quantum mechanics.

The only justification for a statistical technique is that a normal law is
subsumed by technique with a probability of 1.
But in general we are trying
to induce the law using some sort of statistical processing where we define
the terms of reference e.g the grammar or the attributes and so on. Goof it
up and you goof up the whole thing. Domain knowledge is more important than statistics or number crunching to figure out the supposed law."Garbage in garbage out" as I learnt ages back from a computer text.

In any form of statistical processing we are doing two things first we are using inductive logic . Second we are specifying the parameters in terms of which to find the law or the statistical model parameters. The parameters represent our belief in a particular sort of formulation of the law. Supposing that we do not know about the ideal gas law we might gather data of pressure , volume and the colour of the gas and get a good correlation but as we all know we might miss out on the actual law. In the context of NLP we need to distinguish between language model and a statistical language model. The former underlies the latter and no amount of refinement of the latter can compensate for deficiencies in the former.To put more concretely a change in grammar will mean a different tree bank or it's equivalent.You may improve on the maximum likelihood of the data but then the data you collect is based on what you believe are the parameters governing the process.

Pure Grammar solutions Like the HPSG/LFG rely on grammar which is another term word order and FOL semantics.

Is there really a grammar as can describe most if not all of the language? Unfortunately we work on the assumption that there is. When you look at the number of rules in the raw treebank-(around 15000) something seems lacking, even removing redundant rules or subsumed rules still leaves you with a big bunch of rules and that is the tip of the proverbial iceberg.

FOL is a subset of all reasoning and is the basis of all semantics take it or leave it. So the sum total of all Methods HPSG/LFG ( and others of their ilk included) can at best represent or model a tiny subset of language.

Whither new Model?

No answers from me but there is only one thing and that I am an optimist and that I hope some clever Philospher/Linguist/NLP whiz will solve the general language understanding problem.

Is Mathematics a Formal Logical system and Can Mathematics represent Natural language?

Basically formal logic was invented to make proofs foolproof. This means one can conjecture a Theorem in maths and prove it using valid reasoning procedure (deductive logic) to the satisfaction of peers.

I claim "Mathematics is not equivalent to a Formal logic System". To justify this I only point out that there are statement in Arithmetic which are true but cannot be proved from a set of axioms.

I believe there might be alternative paradigms to the present FOL plus grammar approach.

No comments: