Understanding Transformers via N-Gram Statistics

I released a preprint of my paper Understanding Transformers via N-Gram Statistics last Friday, which provides insights into the ways in which LLM behavior can be described in terms of simple statistical rules. I wrote a detailed X thread summarizing the paper, so I don't have anything else to add to that for now. I'm … Continue reading Understanding Transformers via N-Gram Statistics