DataGemma

DataGemma is a research tool that lets users ask questions in plain language and receive answers based on publicly available statistical data in the Data Commons repository. The tool uses specially built versions of Gemma, the Gemini API with Gemini 1.5 Pro, and a set of libraries specifically designed to work with Data Commons.

This research tool provides two separate techniques for answering questions based on Data Commons statistical data:

  • Retrieval-Interleaved Generation (RIG) - This approach uses a variant of Gemma 2 that is fine-tuned to recognize when it needs to replace a generated number with more accurate information from Data Commons. For more details, see the Colab notebook and models on Kaggle or Hugging Face.
  • Retrieval-Augmented Generation (RAG) - This approach uses a variant of Gemma 2 that retrieves relevant information from Data Commons and then uses that information to create an extended prompt for the Gemini 1.5 Pro model. For more details, see the Colab notebook and models on Kaggle or Hugging Face.

For more research and technical details on DataGemma, see the DataGemma technical paper.

  • Apply generative artificial intelligence (AI) to a vast repository of public statistical data to explore and uncover new insights.
  • Investigate ways to guide generative AI model output with retrieval-augmented and data-interleaved techniques.

Learn more

View more code, notebooks, information, and discussions about the DataGemma RIG model on Kaggle.
Try DataGemma using the retrieval-interleaved technique to answer questions.
Try DataGemma using the retrieval-augmented technique to answer questions.