Pomo

Anthropomorphize anything - speak with your pets, paintings, and more

What it does

We were inspired by the Google Project Astra product demo but wanted to change the system prompt to try new and fun use cases. Unfortunately, Astra is not released yet nor would it have an API with system prompt manipulation so we set out to create our own open source version.

For a starting use case we use a series of AI models to allow users to interact with their surroundings in new and entertaining ways! Specifically, they can anthropomorphize anything, from their pet dog/cat, to a painting on a wall, to their coffee they are drinking. The user clicks on an object which we create a mask over using TensorFlow models and send the cutout of the object along with the background as two images to Gemini Flash (see https://ai.google.dev/edge/mediapipe/solutions/vision/interactive_segmenter). Gemini identifies the object and we start a new Gemini stream chat where the system prompt informs Gemini of it's new role (e.g. the painting on the wall). The user can then converse with this new anthropomorphized object / animal.

We use voice-activity-detection (VAD) to identify when the user is speaking and, after 1.3 seconds of silence, we send the latest screen capture of their camera along with the audio to Gemini to continue the conversation. When Gemini responds, the text is converted to speech using ElevenLabs text-to-speech streaming API. Between Google image segmenter, optical flow, Gemini Flash (twice), VAD, and text-to-speech, we use 6 AI models in our pipeline.

- Sam & Tim

Built with

Web/Chrome
interactive segmenter by Google

Team

Pomo

From

United States