Pre-trained transformer language models have become a stepping stone towards artificial general intelligence (AGI), with some researchers reporting that AGI may evolve from our current language model technology. While these models are trained on increasingly larger datasets, the documentation of basic metrics including dataset size, dataset token count, and specific details of content is lacking. Notwithstanding proposed standards for documentation of dataset composition and collection, nearly all major research labs have fallen behind in disclosing details of datasets used in model training. The research synthesized here covers the period from 2018 to early 2022, and represents a comprehensive view of all datasets—including major components Wikipedia and Common Crawl—of selected language models from GPT-1 to Gopher.
Dr Alan D. Thompson is a world expert in artificial intelligence (AI), specializing in the augmentation of human intelligence, and advancing the evolution of ‘integrated AI’.
Alan provides AI consulting and advisory to intergovernmental organizations including member states of the European Union, the Commonwealth, and the World Trade Organization. Alan’s applied AI research and visualizations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. His 2021-2022 experiments with Leta AI and Aurora AI have been viewed over 1.5 million times.
Prior to his work in AI, Alan was a major contributor to human intelligence research and practice. As chairman for Mensa International’s gifted families committee, Alan served two consecutive terms sharing best practice among 54 countries, and his work on gifted education was referenced in the Department of Education’s High Potential policy.
Alan’s best-selling book, Bright, was made available to families at Elon Musk’s gifted school. A copy of the book will be sent to the moon aboard the Peregrine lunar lander in 2022.
Alan continues to advise intergovernmental organizations, enterprise, and international media in the fields of artificial intelligence and human intelligence, consulting to the award-winning series Decoding Genius for GE, Making Child Prodigies for ABC (with the Australian Prime Minister), 60 Minutes for Network Ten/CBS, and Child Genius for Warner Bros.
Alan completed his Bachelor of Science (Computer Science, AI, and Psychology) at Edith Cowan University, 2004; studied Gifted Education at Flinders University, 2017; became a Fellow of the Institute of Coaching affiliated with Harvard Medical School, 2017; and received his doctorate from Emerson, 2021. Alan’s dissertation was adapted into a book featuring Dr Rupert Sheldrake, Connected: Intuition and Resonance in Smart People.