I don't think this is necessarily a good idea. I especially don't think it will be easier than using DCGs. You still need to write some sort of proper grammar that knows more about the input's structure than just "combine certain sequences of words". That said, there is some work on parsing with CHR, see CHR Grammars for example.
Your first problem is that CHR constraints are unordered. When you try to match w(L), w(on), w(R)
, you have no guarantee that L
is actually to the left of R
in the input. This information is not recorded for you.
So what you can do is to record it yourself. You can replace each w(X)
with a w(X, Start, End)
constraint where Start
and End
are some sort of input position markers. For example, simply numbering tokens from left to right:
:- use_module(library(chr)).
:- chr_constraint w/3.
parse(Tokens) :-
parse(Tokens, 0).
parse([], _Position).
parse([Token | Tokens], Position) :-
NextPosition is Position + 1,
w(Token, Position, NextPosition),
parse(Tokens, NextPosition).
You can use this as follows:
?- parse([the, box, is, on, the, table]).
w(table, 5, 6),
w(the, 4, 5),
w(on, 3, 4),
w(is, 2, 3),
w(box, 1, 2),
w(the, 0, 1).
Given this format, you can enhance your matching rules to only combine adjacent inputs (where each component's end marker is the same as the "next" component's start marker):
w(the, S, M), w(X, M, E) <=> w([the,X], S, E).
w(L, S, M1), w(on, M1, M2), w(R, M2, E) <=> w(on(L,R), S, E).
w(L, S, M1), w(is, M1, M2), w(R, M2, E) <=> w(is_(L,R), S, E).
(I use is_
instead of is
because is
is a built-in operator, and is(X, Y)
facts would be printed as X is Y
, which is confusing in what follows.)
This allows you to make some progress:
?- parse([the, box, is, on, the, table]).
w([the, table], 4, 6),
w(is_([the, box], on), 0, 4).
We see that the
and table
at positions 4 to 5 and 5 to 6 were merged together into a phrase the, table
at position 4 to 6. Also the words the
, box
, is
, on
, were merged together into one object covering positions 0 to 4. But this is not correct: "the box is on" is not a valid phrase. The phrase to the right of "is" must be a location or other situation. You need more information about the kinds of sub-phrases in a phrase. You need proper grammar information, the kind that you would encode in a DCG by defining rules with different names.
Here is an expanded version:
:- chr_constraint noun_phrase/3, situation/3, sentence/3.
w(the, S, M), w(X, M, E) <=>
noun_phrase([the,X], S, E).
w(on, S, M), noun_phrase(R, M, E) <=>
situation(on(R), S, E).
noun_phrase(L, S, M1), w(is, M1, M2), situation(R, M2, E) <=>
sentence(is_(L,R), S, E).
And this works:
?- parse([the, box, is, on, the, table]).
sentence(is_([the, box], on([the, table])), 0, 6).
But at this point you are writing a DCG in ugly syntax and with lots of extra information you must track by hand. Why not write the DCG instead?
noun_phrase([the, X]) -->
[the],
[X].
situation(on(Place)) -->
[on],
noun_phrase(Place).
sentence(is_(Thing, Where)) -->
noun_phrase(Thing),
[is],
situation(Where).
This works just as nicely:
?- phrase(sentence(Sentence), [the, box, is, on, the, table]).
Sentence = is_([the, box], on([the, table])).
If you want to get this information into a CHR constraint, you can do that by calling that constraint. For example, with the DCG above:
:- chr_constraint sentence/1.
parse(Tokens) :-
phrase(sentence(Sentence), Tokens),
sentence(Sentence).
This gets the same CHR result as above, but without the position markers which are no longer needed:
?- parse([the, box, is, on, the, table]).
sentence(is_([the, box], on([the, table]))).