michael-dean-k/

On Monday 6/15, I'm hosting a workshop to kick off a reading group for classic essays: RSVP here.

← all posts

Lazy tokenization

· 152 words

Do hallucinations come from lazy tokenization? Just had an AI tell me that Joan Didion wrote an essay called “On Grief and Grieving.” Does not exist. She did write The Year of Magical Thinking, a memoir that touches on grief. It turns out, On Grief and Grieving is actually the title of Elizabeth Kubler Ross’s book. In trying to solve this, I found a college essay—on grief—and it listed it’s sources at the end: The Year of Magical Thinking by [Joan Didion; On Grief and Grieving] by Elizabeth Kubler Ross (added brackets for emphasis); Tuesdays with Morrie by Mitch Albom …” Do you see what it did? One of the sins of bulk data ingestion is that AI arbitrarily splits context for tokenization (ie: every X words), and so in this case, it’s mixing one author with another author’s book, simply because they are adjacent in some student’s college paper source list.