Hello I am traveling this weekend! I am home in Ontario. I wanted to share this picture I took of this guy feeding a bear.
Don’t feed bears! This is not an endorsement of this behaviour! Quite the opposite!
We love Datalog here. We love it! We're fiends for it. Datalog is very elemental and that makes it a good candidate for having lots of interesting and simple extensions to its baseline.
A Datalog query, at its core, is just a way of expressing a query over a database that produces a new database. It's sort of more powerful than relational algebra because it
1. allows for multiple "outputs," by way of constructing multiple different relations that the user might care about in the result of the query, and
2. supports recursion, in that relations can transitively reference themselves to construct new rows.
But one of the first things you learn is that vanilla Datalog doesn't support negation. Which feels weird. Because negation is so trivial to add to a traditional imperative or functional language, it seems like something must have gone horribly wrong if it causes "problems." Relational algebra supports negation, I just write NOT
and there are no problems at all! So what the heck.
There are two big problems that come up.
The first problem is what logic programmers call "safety." Have you ever noticed that programmers love calling properties "safety." They love defining safety and then arguing about what has more safety. Anyway. To Datalog people, safety means your query won't accidentally be infinite. This is a classic high-level programming language-style guarantee: we promise you your program won't have a literally infinite result. It could of course still be real big. So big you can’t practically compute it. But not infinite, so don't worry about that.
Anyway. The problem is that we can now write things like
q(1).
q(2).
q(3).
p(x) <- !q(x).
So p(x)
is all the things which q
is not. So it's definitely not 1. And it's definitely not 2. And it's definitely not 3. And, like, you seem awful sure what it ain’t, but what about what it am?
I suppose it's 4? And 5? And 6? And the string "foo"? And "potato"?
Maybe we can resolve this by saying, "the only fair game for values is values that have appeared in other locations in the program." That seems fair—nothing that the program doesn't "know to exist" can be included. Unfortunately this is a bad solution because it breaks compositionality. Now, if I add in a new module to this program that doesn't reference this one at all, it changes the behaviour. So this is No Good.
The solution that is typically chosen is to require every variable to occur at least once in a term that is not negated. By doing this, we guarantee that every piece of data did flow through the graph to reach us and we don't have this question of where to draw our possible values from.
The second is the introduction of self-reference: with negation, it's now possible to express statements of the form "this statement is false:"
p() <- !p().
This is no good because it means we can have valid Datalog programs that don't have a canonical result. It's neither correct for p()
to be true nor false.
It's not obvious how to prohibit this. The most general solution would be to say "you can't ever have that," meaning, "when executing a program, I can't ever have a term which implies its own negation." There are programs for which this is true, where you'll never run into this problem.
The programs where you will never have this happen are called locally stratified.
Unfortunately, you can't really tell, by looking at a Datalog program, if it has this property, that it's never going to have that kind of self-referential negation. But we would really like to be able to have negation in our programming language. It's really useful! It's so useful that we're happy to have it even if we only have it in some cases.
You know how in programming languages with correctness checkers, there's this general idea that they're going to be overzealous? Like, for our programming language's semantics, we would like to be able to colour every program red or blue, where they're red if they're going to do something bad, and blue if they're not going to do something bad. And it turns out that in most cases, for most things, it's too hard or too expensive or impossible for that to be a perfect categorization: you can't have a compiler that can correctly colour every bad program red and every good program blue. But! You prefer to err on the side of safety. You say, I'm okay with some good programs being coloured red by my compiler if it means that every bad program is coloured red. That is an okay tradeoff. This is how the borrow-checker in Rust works, for example.
So this is the same kind of trick we pull with Datalog: we define a much stronger property than local stratification, which is very easy to check, and which implies local stratification, at the cost of disallowing some locally stratified programs. The property we use is just called stratification, giving us stratified Datalog, and it says that in the graph of relations, where there's a positive edge from A
to B
for every rule of the form
B(...) <- ..., A(...), ...
and a negative edge from A
to B
for every rule of the form
B(...) <- ..., !A(...), ...
A program is stratified if there are no cycles which involve a negative edge.
This is sort of using a machine gun to kill a fly: local stratification requires us to have no negative cycles at the row level, and we are lifting this whole baby up to the relation level. This is like using table-level locks to provide serializability. We're sort of throwing everything out the window and throwing up our hands. But it does work!
It's called "stratification" because this rule is equivalent to being able to split out all the relations in a program into ordered "strata," in which edges only go in one direction, and negative edges only exist in stratum boundaries. Plus, this way suggests a tool for evaluating stratified Datalog: you evaluate it in stratum-order, which means that by the time you have to check if something is not in a given relation, you know you’ve fully computed that relation, so you know definitively if something is not in it.
Once you’ve done this, you have a functional negation operation, and this is the classic vanilla “Datalog plus negation” that people use practically. There are other extension, like Dedalus, which uses another trick to provide another useful subset of locally stratified programs.
I don’t really have a pithy answer for “why adding negation to logical programming is harder than for traditional programming.” I guess traditional programming doesn’t really involve you making logical assertions all that often, at least ones that will cause weird problems if they’re inconsistent.