Gemma 2 features the same extremely large vocabulary from release 1.1, which tends to help with multilingual and coding proficiency.
Gemma 2 9B was trained on a wide dataset of 8 trillion tokens, 30% larger than Gemma 1.1, using similar datasets including:
Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content.
Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions.
Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.