Science & Tech

AI’s Robin Hood
Aaron Gokaslan ’18, ’19 ScM, gives the secrets of AI to the people

By Jack Brook '19 / April–May 2025
April 10th, 2025
Close-up image of Aaron Gokaslan with a concrete brick wall behind him.
Gokaslan in Dublin last summer, where he was one of 25 people Mozilla named as leading the next wave of the Internet.PHOTO: MOZILLA

When the world’s leading artificial intelligence research group, OpenAI, released an early version of ChatGPT in 2019, the organization warned it was too dangerous to share the inner workings of its chatbot with the public.
Aaron Gokaslan ’18, ’19 ScM disagreed. 

To prove a point, he and classmate Vanya Cohen ’18, ’19 ScM, reverse-engineered the cutting-edge large language model—capable of writing Shakespearian plays or presidential speeches based on prompts from users—developed by the brightest minds in Silicon Valley. Together, Gokaslan and Cohen drew on millions of web pages sourced from Reddit to largely replicate the mysterious datasets that made ChatGPT possible. They publicly shared their own underlying datasets, allowing anyone to develop and fine-tune a chatbot.

“A lot of the work wasn’t appreciated at the time, particularly, four or five years ago, when I started working on it, and people just kind of saw it as a fun novelty,” Gokaslan says. Now it has become a foundational part of AI development, placing language modeling tools in the hands of researchers all over the world. The duo’s publicly available dataset has over 4 million downloads.

While most tech giants are advancing their artificial intelligence research behind closed doors, Gokaslan believes in public access to the tools shaping the future of AI. In 2024, the open source internet nonprofit Mozilla recognized Gokaslan as one of the world’s leading “architects of trustworthy AI” for his work creating and maintaining open source datasets and language learning models available for anyone to use. 

“I want people to be able to build their own small version of these models, or be able to take an existing model and fine tune it to their needs,” says Gokaslan, currently a doctoral student studying AI at Cornell. “How can you help not just accelerate your own research, but everyone’s research?”

Gokaslan also played a leading role in spearheading BLOOM, an AI language model similar to ChatGPT, built by hundreds of international collaborators and trained on a French government supercomputer. Designed to make AI research more accessible, BLOOM speaks more than 46 human languages, from Nepali to Swahili. 

Wary of AI developers infringing on artist copyrights, Gokaslan prepared another dataset called Common Canvas. This compiled 100 million Creative Commons photos, so anyone can craft their own AI-assisted images without stealing from licensed materials.

Gokaslan is mindful of the potential misuse of these tools and advises organizations like Encode Justice, which advocates for ethical AI policies. He also helped pioneer a widely employed license for responsible AI use that can be customized for different products. The basic template includes restrictions on wielding an AI model to do things like discriminate or give medical advice.  

He’s accomplished all this while finding time to help maintain a range of popular code libraries like pybind11 and Pytorch. Open source work may often be thankless but Gokaslan remains inspired by its far-reaching impact.

“Seeing people actually use the technology I built, every day, is extremely gratifying,” he says. 

What do you think?
See what other readers are saying about this article and add your voice. 
Related Issue
April–May 2025