ZettelKasten part 2

Building a system for knowledge management

This is a followup to a previous article, ZettelKasten Part 1

In the previous article, we talked about the concepts behind zettelkasten. In this article, we'll continue rapidly building a zettelkasten by choosing a data store.

Warning: this one gets a bit wandering and ranty. If you're interested, the answer was:

docker pull janusgraph/janusgraph
docker run --name janusgraph-default janusgraph/janusgraph:latest

That's it. The rest is a journey into terrible documentation, incompatabilities, broken websites, and the fact that the most robust systems are always the least flexible (That's So Java).

You literally don't have to read any of it. Just say "enterprise" to yourself 50 times.

The nature of zettelkasten relationships

Zettelkasten is defined by the relationships pieces of information share. It's all about finding surfacing links between data. If you go to the premiere site discussing zettlekasten, zettlekasten.de, you'll find all sorts of diagrams showing how zettle relate to each other. Largely, they indicate tree structures, but then the writer gets smart and says "oh, but two nodes on separate branches can also connect!" Congrats genius, you invented a graph.

Previously, we defined a zettle. It has a UUID, it optionally references some data that we want to archive. And it also has content. And it has tags - the tags are the magic. A zettle can have multiple tags, and all tags result in an index of notes about the tag topic, enabling discovery. A lot of Zettlekasten software seems to operate in a relational (SQL) manner, but we're going to go hard into the graph aspect because there's a lot of things to be gained from performing clustering analysis to enable surfacing of data.

But the content of a zettel isn't well stored in a graph generally. And of course the archived reference data isn't either. Or is it?

There exist a class of databases called "Multimode" databases, and one of them, OrientDB, caught my eye. It's able to store documents as vertices and provide traversals. This is basically the best of all worlds.

Setting up OrientDB

OrientDB has a handy docker container. This is great, because I can run it on my NAS, which is my end goal. But you can run it in the cloud on any of the cloud providers. That's good. And I could decide to do that too.

Install Docker
docker run -d --name orientdb -p 2424:2424 -p 2480:2480 -e ORIENTDB_ROOT_PASSWORD=somepassword orientdb:latest
Wherever you've done that, go to port 2480

Bam, you've got the studio up.

Well that's easy enough. Let's make an API

Alright. Let's set up koa. And koa-router. You'll find plenty of tutorials online, so I won't bother repeating them here.

Now, you may think we're going to be using OrientDB's weird shitty SQL (no offense, for as advanced as the capabilities of OrientDB, they've really managed to make the additional features of their SQL-like language work well) but actually OrientDB is TinkerGraph-compliant. So we're using Gremlin!

Gremlin is a graph-traversal syntax that's just plain fun to work with. Much better than SPARQL. Yuck, SPARQL.

Oh wait, set up OrientDB

OrientDB by default doesn't provide gremlin despite it being a selling point of its multimode nature. You would think this would be mentioned in the docs under "APIs and Drivers -> Gremlin API" but you'd be wrong. It's mentioned under "APIs and Drivers -> Java API -> Apache TinkerPop 3" because Java people are not aware that other languages exist, so naturally their solipsism leads them to believe that's a perfectly reasonable way to arrange documentation.

Anyways, you've got to build the entire thing from source, and then build the gremlin part from source with Maven. Hope you love compiling stuff with Maven for more than 15 minutes.

Okay, now let's connect with Gremlin

Okay let's make a gremlin connector

But, since I want to work with typescript/ecmascript, I've got a bit of a challenge: OrientDB doesn't have a Gremlin connector for Javascript. They seem to assume you'll do any Gremlin stuff in a JVM language (like I said, Java solipsism). They accept gremlin commands though over their HTTP interface to the server, though. And TinkerPop has a fantastic Javascript connector. It just doesn't work with OrientDB because OrientDB doesn't speak Gremlin bytecode websockets and the connector only speaks to websockets.

On second thought, let's not use OrientDB.

As of this writing, their site is literally broken. I think it's a dying project. Let's see how far we can get with pure graphs.

Here's the thing: sometimes you try something and it sucks. That's the point of experimentation. We just tried something now and it sucked. There's a bunch of wasted effort and that happens. But we wouldn't have known if we hadn't tried.

I know I like Gremlin, so let's go find something free and easy to run on Tinkerpop's page. And we'll figure out the document store part later when we're storing large documents. Let's see if we even run into a limitation in the first place, before we start trying to optimize.

Oh another multi-model database: ArangoDB, which I previously rejected because its wikipedia page said it had a "unified query language AQL (ArangoDB Query Language)" which just means "people can't think beyond SQL" - trust me, for graphs gremlin is much easier to work with. I can't get enough of it. Luckily ArangoDB has a TinkerPop provider. "Provider" is such a nonsense software word by the way. Like "Helper" it just means "I stashed some stuff over here so it's not over there."

Man, ArangoDB really hits you with the upsell.

Let's not use Arango

Nevermind, remember what I was saying about "Provider" being a nonsense software word? Well in this case, again, it was. It's not an ArangoDB thing, it translates Java stuff into connecting to Arango with gremlin syntax. Not what we need.

Back to the drawing board. What the heck is out there that I can:

Stand up in Docker in the next 30 minutes
Query in standard gremlin ways
Persists, isn't memory-only

Those are my requirements. I can work out other stuff later.

I found GRAKN.AI and it looked pretty cool but that's another matter, it's not quite what we're looking for.

Okay, screw it: JanusGraph. Whatever, I know it's full of corporate-speak "enterprise" crap like BerkeleyDB (Oracle, gross), but it'll be the easiest to get up and running.

docker run --name janusgraph-default janusgraph/janusgraph:latest
docker run --rm --link janusgraph-default:janusgraph -e GREMLIN_REMOTE_HOSTS=janusgraph \
    -it janusgraph/janusgraph:latest ./bin/gremlin.sh

A friggin' console. There you go.

Alright, next up, we'll definitely get going on that API.

The adventure continues in
ZettelKasten Part 3