Just open sourced the datasets used in my NeurIPS 2024 paper Understanding Transformers via N-Gram Statistics. It includes training data and associated n-gram data to enable the research community to replicate and build upon my work measuring to what extent LLM predictions can be described in terms of n-gram statistics.