Build an AI content search with Docs Agent

Searching for information is one of the most common uses of artificial intelligence (AI) language models. Building a conversational search interface for your content using AI allows your users to ask specific questions and get direct answers.

This tutorial shows you how to build an AI-powered, conversational search interface for your content. It's based on Docs Agent, an open source project that uses Google Gemini API to create a conversational search interface, without training a new AI model or doing model tuning with Gemini models. That means you can get this search capability built quickly and use it for small and large content sets.

For a video overview of the project and how to extend it, including insights from the folks who build it, check out: AI Content Search | Build with Google AI. Otherwise you can get started extending the project following the instructions below.

Overview

The Docs Agent project provides a conversational search interface for a specific content set, backed by the Google Gemini API and language models. Users can ask a detailed question in a conversational style and get a detailed answer based on a specific content set. Behind the scenes, the Docs Agent takes the question and searches against a vector database of the content, and creates a detailed prompt for the language model, including snippets of relevant text. The large language model generates a response to the question and the Docs Agent formats the response and presents it to the user.

Functional diagram of Docs Agent Figure 1. Functional diagram of Docs Agent project app.

The key to making Docs Agent able to answer questions about your content is the creation of a vector database of that content. You separate your content into logical chunks of text and generate a vector for each of them. These vectors are numeric representations of the information in each chunk and are generated with an AI text embedding function from Google's language models.

When a user asks a question, the Docs Agent uses the same text embedding function to create a numeric representation of that question, and uses that value to search the vector database and find related content. It takes the top results and adds that information to a prompt for the language model. The AI model takes the question and the additional context information and generates an answer.

Project setup

These instructions walk you through getting the Docs Agent project set up for development and testing. The general steps are installing some prerequisite software, setting a few environment variables, cloning the project from the code repository, and running the configuration installation. The code project uses Python Poetry to manage packages and the Python runtime environment.

Install the prerequisites

The Docs Agent project uses Python 3 and Python Poetry to manage packages and run the application. The following installation instructions are for a Linux host machine.

To install the required software:

  1. Install Python 3 and the venv virtual environment package for Python.
    sudo apt update
    sudo apt install git pip python3-venv
    
  2. Install Python Poetry to manage dependencies and packaging for the project.
    curl -sSL https://install.python-poetry.org | python3 -
    

You can use Python Poetry to add more Python libraries if you extend the project.

Set environment variables

Set a few environment variables that are required to allow the Docs Agent code project to run, including a Google Gemini API Key and Python Poetry setting. You may want to add these variables to your $HOME/.bashrc file if you are using Linux, to make them default settings for your terminal sessions.

To set the environment variables:

  1. Get a Google Gemini API Key and copy the key string.
  2. Set the API Key as an environment variable. On Linux hosts, use the following command.
    export API_KEY=<YOUR_API_KEY_HERE>
    
  3. Resolve a known issue for Python Poetry by setting the PYTHON_KEYRING_BACKEND parameter. On Linux hosts, use the following command.
    export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
    

Clone and configure the project

Download the project code and use the Poetry installation command to download the required dependencies and configure the project. You need git source control software to retrieve the project source code. external To download and configure the project code:

  1. Clone the git repository using the following command.
    git clone https://github.com/google/generative-ai-docs
    
  2. Optionally, configure your local git repository to use sparse checkout, so you have only the files for the Docs Agent project.
    cd generative-ai-docs/
    git sparse-checkout init --cone
    git sparse-checkout set demos/palm/python/docs-agent/
    
  3. Move to the docs-agent project root directory.
    cd demos/palm/python/docs-agent/
    
  4. Run the Poetry install command to download dependencies and configure the project:
    poetry install
    

Prepare content

The Docs Agent project is designed to work with text content, and it includes tools specifically to work with websites that use Markdown as the source format. If you are working with website content, you should preserve (or replicate) the directory structure of the served website to enable the content processing task to map and create links to that content.

Depending on the format and details of your content, you may need to clean your content to remove non-public information, internal notes, or other information that you do not want to be searchable. You should retain basic formatting such as titles and headings, which help create logical text splits, or chunks, in the content processing step.

To prepare content for processing:

  1. Create a directory for the content you want the AI agent to search.
    mkdir docs-agent/content/
    
  2. Copy your content into the docs-agent/content/ directory. If the content is a website, preserve (or replicate) the directory structure of the served website.
  3. Clean or edit the content as needed to remove non-public information, or other information you don't want included in the searches.

Use Flutter docs for testing

If you need a set of content for testing Docs Agent, you can use the Flutter developer docs for testing.

To get the Flutter developer docs:

  1. Move to the content directory for the content you want the AI agent to search.
    cd docs-agent/content/
    
  2. Clone the Flutter docs into the docs-agent/content/ directory.
    git clone --recurse-submodules https://github.com/flutter/website.git
    

Process content

In order for the search agent to effectively search for content related to users' questions, you need to build a database of vectors that represent your content. The vectors are generated using an AI language model function called text embedding. Text embeddings are numeric representations of the text content. They approximate the semantic meaning of the text as a set of numbers. Having numeric representations of information allows the system to take a user's question, approximate its meaning using the same text embedding function, and then find related information as a mathematical calculation, using a k-nearest neighbors (k-NN) algorithm.

Split text content

The amount of text that a text embedding vector can effectively represent is limited. This project limits the text represented in a vector to 3000 characters or less, and that means you have to split up your content into chunks under that character limit. This section describes how to use a script provided with the Docs Agent project to split Markdown files into smaller text chunks. For tips on working with other content formats, see Other content formats.

To split Markdown format content:

  1. Configure the input parameters for the processing script by editing the docs-agent/config.yaml file. This example targets a subset of the Flutter docs:
    input:
    - path: "content/website/src/ui"
      url_prefix: "https://docs.flutter.dev/ui"
    
  2. Save your changes to this configuration file.
  3. Split the Markdown source content by running the markdown-to-plain-text.py script:
    poetry run python3 scripts/markdown_to_plain_text.py
    

The script processes the input content and creates output text files in the docs-agent/data directory, splitting the text based on titles, headings, and related paragraphs. Processing may take some time depending on the size of your content.

Create text embedding vectors

After splitting your content into appropriately-sized, meaningful chunks, you can populate the vector database with your content using a text embedding function. The Docs Agent project uses the Chroma vector database to store text embedding vectors. These instructions cover how to use the Docs Agents script to populate a vector database with your split content.

To generate text embeddings and populate the vector database:

  1. Navigate to the docs-agent project directory:
    cd docs-agent/
    
  2. Populate the vector database with your content using the populate_vector_database.py script:
    poetry run python3 scripts/populate_vector_database.py
    

This script uses the Google Gemini API to generate text embedding vectors and then saves the output to the vector database. Processing may take some time depending on the size of your content.

Other content formats

The Docs Agent project is designed to work with website content in Markdown format. You can use other content formats with the project, however those additional methods need to be built by you or other members of the community. Check the code repository Issues and Pull Requests for folks building similar solutions.

The key thing you need to build to support other content formats is a splitter script like the scripts/markdown_to_plain_text.py script. Aim to build a script or program that creates similar output to this script. Remember that the final text output should have minimal formatting and extraneous information. If you are using content formats such as HTML or JSON, make sure you strip away as much of the non-informational formatting (tags, scripting, CSS) as possible, so that it does not skew the values of the text embeddings you generate from them.

Once you have built a splitter script for content format, you should be able to run the populate_vector_database.py script to populate your vector database.

Test the app

When you have completed populating your vector database, the project is ready for testing. The project provides a packaging function to let you run the project locally.

To run and test the project from the command line:

  1. Navigate to the docs-agent project directory:
    cd docs-agent/
    
  2. Run the test script:
    poetry run python3 scripts/test_vector_database.py
    

To run and test the project web interface:

  1. Navigate to the docs-agent project directory:
    cd docs-agent/
    
  2. Run the web application launch script:
    poetry run ./chatbot/launch.sh -p 5000
    
  3. Using your web browser, navigate to the URL web address shown in the output of the launch script and test the application.
    * Running on http://your-hostname-here:5000
    

Additional resources

For more information about the Docs Agent project, see the code repository. If you need help building the application or are looking for developer collaborators, check out the Google Developers Community Discord server.

Production applications

If you plan to deploy Docs Agent for a large audience, note that your use of the Google Gemini API may be subject to rate limiting and other use restrictions. If you are considering building a production application with the Gemini API like Docs Agent, check out Google Cloud Vertex AI services for increased scalability and reliability of your app.