Cognitive Collective

Helping you find your next career in AI. Learn more about the job board on the Scale blog.

Are you a scaling AI startup? Email to be added to our board.

Staff Data Engineer - Research/Machine Learning



Software Engineering, Data Science
Menlo Park, CA, USA
Posted on Friday, March 22, 2024

About us

Character’s mission is to empower everyone with AGI. Our vision is to enable people with our technology so that they can use Character.AI any moment of any day.

Character.AI is one of the world’s leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character.AI is a full-stack AI company with a globally scaled direct-to-consumer platform. As of 2023 that platform was #2 in the space in user engagement. Character.AI is uniquely centered around people, letting users personalize their experience by interacting with AI “Characters.” The company achieved unicorn status in 2023 and was named Google Play’s AI App of the Year.

Noam co-invented the key tech powering LLMs and was recently named to TIME100’s Most Influential People in AI list. TIME called him “one of the most important and impactful people of the space’s past, present, and future.” Daniel created and led LaMDA, the breakthrough conversational tech project currently powering Bard.

To learn more, please visit

About the role

You would be a great fit for this role if you are an experienced engineer who will be instrumental in building the world's best LLMs by collecting and refining the essential training data that powers them. In pursuit of the best language models, your responsibility is twofold:

  • First, identify and collect data at the scale required to feed our largest models. This involves managing a diverse set of sources, including structured and unstructured content from text and multimedia formats. Your engineering expertise is crucial in crafting the infrastructure and tools necessary to efficiently collect and manage petabytes of data.

  • Second, you will experiment with various methods of extracting a balanced and comprehensive training dataset from the raw data. You will leverage your expertise in data to build datasets reflecting a hypothesis, train models, and evaluate experimental results. Through this experimentation, you will create the training datasets for our largest models.

These are critical steps in the construction of AI. With petabytes of data and numerous design decisions, each step requires careful attention. Expertise in AI is not necessary, but enthusiasm for the space and a track record of adapting to new domains is important.

Who we’re looking for

Required Experience:

  • 5+ years of production software engineering experience

  • Experience building large-scale data processing pipelines, with tools like PySpark, Beam, or Flink

  • Familiarity with Machine Learning and NLP and willingness to learn more on the job

  • Track record of adapting to new domains and a desire to use data to improve products

Additional Desired Experience:

  • ML experience as an ML engineer, Data Scientist, or another similar role

  • Experience with cloud platforms like AWS or Azure, or tools such as Kubernetes and Terraform

  • Passionate about Conversational AI or large language models

You will be a good fit if you are proactive and have a “get things done” mindset. Given our current pace of growth and load on our systems, most people have had a significant impact during their first week at the company.

Character is an equal opportunity employer and does not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status. We value diversity and encourage applicants from a range of backgrounds to apply.