Core Concepts¶

This document explains the key concepts behind Mantis AI and how it works under the hood. Understanding these concepts will help you get the most out of the library.

Architecture Overview¶

Mantis AI is built around a simple, consistent workflow:

Input Processing: Audio files or YouTube URLs are validated and prepared
Audio Processing: Audio is sent to Google's Gemini AI model
Result Formatting: The model's response is processed and returned in a clean format

Mantis Architecture

Key Components¶

Audio Source Handling¶

Mantis AI supports two types of audio sources:

Local Audio Files: Direct processing of MP3, WAV, M4A, and OGG files
YouTube URLs: Automatic downloading and processing of YouTube audio

When you provide a YouTube URL, Mantis: 1. Validates the URL format 2. Downloads the audio using yt-dlp 3. Saves it to a temporary file 4. Processes it like a local audio file 5. Cleans up the temporary file when done

Gemini AI Integration¶

Mantis AI uses Google's Gemini AI models to process audio. The workflow is:

The audio file is read into memory
A prompt is created based on the task (transcribe, summarize, or extract)
The audio and prompt are sent to the Gemini API
The model processes the audio and returns a text response

Clean Output Processing¶

By default, transcription removes disfluencies and speech artifacts:

Filler words ("um", "uh", "like")
False starts and repetitions
Other speech artifacts

This results in clean, readable text that preserves the original meaning.

Core Functions¶

Transcribe¶

The transcribe function converts audio to text:

result = mantis.transcribe(audio_file, clean_output=True)

Under the hood: 1. The audio file is validated 2. A prompt is created asking the model to transcribe the audio 3. If clean_output is True, the prompt includes instructions to remove disfluencies 4. The audio is sent to Gemini AI with the prompt 5. The transcription is returned as a string

Summarize¶

The summarize function generates a concise summary of the audio:

result = mantis.summarize(audio_file, max_length=None, language="English")

Under the hood: 1. The audio file is validated 2. A prompt is created asking the model to summarize the audio 3. If max_length is specified, the prompt includes a length constraint 4. The language parameter specifies the output language 5. The audio is sent to Gemini AI with the prompt 6. The summary is returned as a string

Extract¶

The extract function retrieves specific information based on a custom prompt:

result = mantis.extract(audio_file, prompt, structured_output=False)

Under the hood: 1. The audio file is validated 2. The user's prompt is combined with the audio 3. If structured_output is True, the prompt is enhanced and a JSON schema is supplied via Gemini's response_schema parameter 4. The audio is sent to Gemini AI with the prompt and schema 5. The response is validated against the schema and returned as structured data (with graceful fallbacks when parsing fails)

Domain-specific schemas¶

Mantis ships with reusable schemas for common tasks such as summarising speakers, generating action items, and providing a single AudioInsightsSchema that combines key signals. You can reuse these schemas via the mantis.response_schemas.COMMON_RESPONSE_SCHEMAS registry or pass your own Pydantic model / JSON schema using the response_schema parameter on :func:mantis.extract. This allows teams to align the structured output with internal data contracts while still benefiting from prompt engineering in the prompt argument.

Error Handling¶

Mantis AI includes comprehensive error handling:

Input Validation: Ensures audio files and parameters are valid
Network Errors: Handles API connection issues
Processing Errors: Manages issues during audio processing
Cleanup: Ensures temporary files are removed even if errors occur

All errors are wrapped in specific exception types that inherit from MantisError, making it easy to catch and handle different error scenarios.

Logging¶

Mantis AI uses a silent-by-default approach to logging:

By default, logging is disabled for clean output
enable_verbose_logging(): Enables informational logging
enable_debug_logging(): Enables detailed debug logging
enable_warning_logging(): Enables only warnings and errors

This allows you to control the verbosity of the library based on your needs.

Next Steps¶

Now that you understand how Mantis AI works, check out:

API Reference for detailed documentation of all functions and parameters
Advanced Usage for more complex scenarios and customization