ProseMirror Collab Performance

I ended up working as a Staff Engineer at a startup that, like so many, was banking its core functionality, “Collaborative Rich Text Editing”, on ProseMirror. We dog-fooded the editor extensively; this was the most documentation focused company I could have ever imagined. We would use collaborative editing for planning meetings, brainstorming sessions, feature design docs, retros, and company town halls.

As the company grew fast on the back of a successful VC round (and on the back of ProseMirror), a use-case of 20+ active editors quickly emerged. During these sessions many users were complaining about choppy editor updates, changes that wouldn’t be seen for minutes at a time, and eventually “freezes” due to us implementing a document lock-out whenever unconfirmed steps passed some threshold.

These are the results of investigating those issues. An explanation, and a solution.

Take my edits!

The ProseMirror project provides a collaborative editing plugin, via the prosemirror-collab project and package. Most people are now familiar with this or the hot new-new; YJS bindings.

ProseMirror-Collab collects unconfirmed steps and sends them to an authority so they can be recorded to the document’s source-of-truth. After the steps have been stored they are broadcast out to all collaborators at which point they must:

Undo all other unconfirmed steps
Apply the confirmed steps
Re-apply the unconfirmed steps mapping them through the confirmed changes(and the inverse of themselves but that’s getting way into the weeds)
Submit their newly mapped unconfirmed steps back to the authority

Okay, easy enough! Where is the bottleneck then? Looking at the servers they are relatively under-utilized; CPU, RAM, and network are not particularly high during these collaborations. Profiling the app in Chrome shows plenty of idle time and nothing particularly telling.

At this point it was apparent that a step back needed to be taken, and a greater understanding and intuition of the prosemirror-collab algorithm was needed. So, I’m going to jump into an analogy that sounds ridiculously on the nose but may be super helpful through its slight reframing.

Alice, Bob, and Hal

Imagine you and 5 collaborators are, before the internet in the way-back, working together to compile a newsletter. One individual, Hal, has agreed to act as the authority. All changes will be mailed to Hal, and Hal will integrate them into the single source-of-truth for the newsletter. However, you must also send Hal the revision of the document you last saw. If this doesn’t match the latest revision Hal has, your changes are sent back rejected with the latest revision so you can try again.

Hal will accept one set of changes per document revision and reject everyone else’s with a note to update their request on with the latest revision and try again. One collaborator “wins”, and the rest whether their changes are still in the mail or wherever are no longer valid.

An odd thing starts to happen. You’ve been trying for weeks to get your changes applied, but Hal keeps rejecting them. You notice that all the recent changes have been from Bob and Alice. What gives?

Unknown to you Alice, Bob, and Hal all live in NYC. You are based out of San Francisco. Alice and Bob have been receiving updates from Hal the next day while you’ve been waiting 2 days! By the time you receive the latest document revision from Hal, Alice and Bob have already mailed their changes.

And this is the exact issue people hit early on with prosemirror-collab performance; some producers get starved out by constant authority rejection while others are getting their changes in consistently.

Starving Artists

In a sense this is an optimistic concurrency control problem. With the added pain of retries needing to be submitted all the way from the end user instead of the server. This greatly expands the time window in which a client’s control value may be invalid without its knowledge. Optimistic concurrency controls often fall short on highly contentious resources due to the retry latency, and so it is for ProseMirror documents.

You could view the time it takes for the first client to receive the latest revision from the authority, and for the authority to receive the clients updates back as a discrete window. The maximum number of updates per second will be equal to the number of minimal windows that fit in each second; a window of 100ms means at most 10 updates per second will be processed!

During sessions with a high number of active users and edits, those with lower latency will have their edits accepted most often. Groups in higher latency bands will have their updates rejected very often while clients with the highest latencies may not see their updates accepted until long after activity subsides and the client backlogs clear.

There are a number of tricks that might be applied to the collab protocol to alleviate the symptoms. You could use a token ring so clients submit changes in centrally determined order. Client rejections and approvals could be tracked to boost rejected clients over recently accepted clients. This starts to add a lot of extra complexity.

Addressing optimistic concurrency control performance issues usually involves some or all of the following:

Switching to pessimistic locking to eliminate retries
Moving the work waiting on the lock as close to the database as possible(preferably in the database)
Avoiding locking altogether

Can we eliminate the need for clients to retry? Can we move the work closer to the database?

Yes We Can: Commit-based Collab

In 2020 user benaubin wrote on discuss.ProseMirror about creating a new commit-based collab backend. The technique was based on a referenced Apache Wave paper. This ended with the release of prosemirror-collab-plus which has not been updated since August 2020.

The gist is that we can batch updates(steps) into commits, apply those as the atomic changes on the back-end, and map/apply the commits on the back-end instead of rejecting them and sending them back to the clients.

With this, client round-trip latency no longer factors into accepting or rejecting updates; client commits can be applied in the order they arrive based on any committed document version.

Here is a high-level breakdown of the algorithm in action:

Client creates a new commit with:
1. A random, unique ref identifying this commit
2. The latest version received from the authority
3. A slice from the front of the local, unconfirmed steps
The client submits this commit to the authority
1. Only one in-flight commit can exist at once
2. The client must get confirmation the authority processed the commit before a new commit can be created
The authority processes the commit
1. It maps the commit through all document commits since the version listed in the commit being processed
2. Ensures only a single commit exists for each unique ref
3. Broadcasts a commit with the next document version, the mapped steps, and the same ref
The client receives commits from the authority
1. If the next version and ref match the expected next version and the in-flight commit ref, the in-flight commit is cleared as it was accepted as the next commit version
2. Otherwise re-base the local, unconfirmed steps on top of the received commit
3. Retries sending the in-flight commit as appropriate

With this system the number of updates per second is determined by how long the document lock is held for determining and applying the next version. In a naive backend implementation this can be as low as 5ms meaning 200 commits per second can be applied!

ProseMirror-Collab-Commit

We took the ideas from prosemirror-collab-plus, the Apache Wave paper, and built a new plugin based on the structure of prosemirror-collab. This new project has brought over the full test suite from prosemirror-collab, with some tweaks to account for server-side step mapping. A few subtle implementation bugs present in prosemirror-collab-plus have been uncovered and addressed along the way; such as the need for the authority to map steps through their predecessors’ inverse and set mapping mirrors.

We are happy to release this new project under MIT license at stepwisehq/prosemirror-collab-commit and on npmjs.org as @stepwisehq/prosemirror-collab-commit.

Lastly, our previously released project, ProseMirror.Net, includes the server-side mapping functionality for use in .Net back-ends.