standing next to me rss

white feather hawk tail deer hunter

so my current project involves simulating the texting experience with LLMs

which sounds pretty basic, right? haven’t people been doing that since forever

yes, but all their innovations have been in relation to personality e.g. memory, anti-sycophant guardrails, typing style, etc.. however, there’s one aspect of texting that, to the best of my knowledge, nobody has tackled: asynchrony.

all LLM conversation is fundamentally turn-based; you say something, the LLM says something; you say something, the LLM says something; and so on, and obviously, nobody really talks like that, which is, for me at least, rather immersion-breaking.

at a basic level, you can use batching/splitting to give the illusion of multiple messages. for example, you can have the LLM output what are meant to be individual texts line-by-line, split them on the consuming side, and present them to the user as separate messages. similarly, you can listen for user messages, combine them into one long string, and send those to the LLM.

however, that’s quite far from full asynchrony; imagine the following human conversation:

A: heyyy
A: saw your Insta story?
A: what's up???

B: ugh I wish
B: like I could just up and disappear for a bit

A: hm okay I get that
A: but
A: like
A: you have lots of commitments, right? what about like work and family and all that

now imagine, in some strange world, A is a human and B is an LLM, which means that the conversation is more like this:

human: heyyy | saw your Insta story | what's up????

LLM: ugh I wish | like I could just up and disappear for a bit 

human: hm okay I get that | but | like | you have lots of commitments, right? what about like work and family and all that

so earlier we talked about batching i.e. multiple human messages get combined into a single prompt…but how do we decide when to do that? the LLM can’t wait forever (well, it can, but then the conversation will never progress).

let’s say you wait 3 seconds after the last message. in the above case, it’s very possible that the system could decide that after the “like” message, the human was done, and start generating an LLM response.

so then, what happens when the human finally finishes typing and sends that message? because the underlying structure is still turn-based, we are faced with few unpalatable options:

  1. restart generating the LLM’s response
  2. ignore the human’s message entirely
  3. immediately trigger another LLM response (to the new content)
  4. defer generating a response by waiting for more human content (that arrives, nominally, after the LLM has “finished responding”)

each of these has issues (cost, latency, engineering complexity, realism, etc.), because:

  1. there is a fundamental impedance mismatch between turn-based and asynchronous conversation:
  2. which choice is socially appropriate is highly dependent on actual judgment

currently I don’t have a good solution, just a bunch of jury-rigged patches, but I feel like I can get there eventually…?