Open sourced my work on LLMs and n-gram statistics

Just open sourced the datasets used in my NeurIPS 2024 paper Understanding Transformers via N-Gram Statistics. It includes training data and associated n-gram data to enable the research community to replicate and build upon my work measuring to what extent LLM predictions can be described in terms of n-gram statistics.

Leave a comment