Telegram Flow - Text to Animated Face Video

Project Overview

Input: French text message received via Telegram channel
Processing: Text-to-speech generation, lip synchronization, video composition
Output: Video file with animated presenter speaking the input text, returned via Telegram
Technology Stack: Telegram Bot API, N8N workflow automation, TTS Chatterbox, Wav2Lip, FFmpeg

📋 Project Status: Planning and documentation phase. This document outlines the complete workflow breakdown and technical requirements. Implementation will follow this specification.

System Flow Diagram

User sends French text message to Telegram channel

Telegram webhook triggers N8N flow

N8N extracts text and validates input (French language check)

N8N calls TTS Chatterbox API with text input

TTS Chatterbox generates speech audio using voice profile from scripts\utils\test_lips\girlstorytelling.mp3

N8N receives generated audio file

N8N triggers lip synchronization process (Wav2Lip)

Lip sync generates video with animated presenter face

N8N composes final video (background + lip-synced presenter)

N8N sends video file back to user via Telegram

Step 1: Telegram Integration Setup

Create Telegram bot using BotFather
Obtain bot token and configure webhook
Set up Telegram channel or group for receiving messages
Configure N8N Telegram trigger node to listen for messages
Implement message validation (check for text, validate French language)
Extract user information (chat_id) for response delivery

📋 Complete Bot Configuration: A comprehensive guide with all bot elements (name, description, commands, messages, error handling) is available in telegram-bot-config.md. This includes ready-to-use French messages, command descriptions, and all text content needed for BotFather setup.

# Telegram Bot Configuration
# 1. Create bot via @BotFather on Telegram
# 2. Bot token: 8596255366:AAEI8PJezO4e30LmscKfo4Q3gVfVa_lVHRU
# 3. Configure webhook URL in N8N Telegram node
# 4. Set up message filter for text messages only

# N8N Telegram Trigger Node Configuration:
# - Credentials: Telegram Bot Token (8596255366:AAEI8PJezO4e30LmscKfo4Q3gVfVa_lVHRU)
# - Update Type: "message"
# - Additional Fields: "text" (to extract message content)
# - Chat ID: Store for response delivery

Telegram Setup: The Telegram bot will act as the entry point for user requests. Messages sent to the configured channel will trigger the N8N workflow. The bot token must be securely stored in N8N credentials vault. See telegram-bot-config.md for complete bot configuration including name, description, commands, and all French messages.

Step 2: N8N Workflow Design

Create new N8N workflow for Telegram integration
Add Telegram trigger node (listens for incoming messages)
Add function node for text validation and language detection
Add HTTP request node for TTS Chatterbox API calls
Add execute command node for lip sync processing
Add execute command node for video composition (FFmpeg)
Add Telegram send message node for status updates
Add Telegram send video node for final video delivery
Implement error handling and retry logic
Add queue management for concurrent requests

# N8N Workflow Structure:
# 
# 1. Telegram Trigger Node
#    - Receives message from Telegram channel
#    - Extracts: message.text, message.chat.id, message.from.id
#
# 2. Function Node: Validate Input
#    - Check if message.text exists
#    - Validate French language (optional: use language detection API)
#    - Return error message if validation fails
#
# 3. HTTP Request Node: TTS Chatterbox
#    - Method: POST
#    - URL: http://localhost:PORT/api/tts (TTS Chatterbox endpoint)
#    - Body: { "text": "{{ $json.message.text }}", "voice": "girlstorytelling" }
#    - Response: Audio file (MP3/WAV)
#
# 4. Execute Command Node: Lip Sync Processing
#    - Command: PowerShell script to run Wav2Lip
#    - Input: Audio file from TTS, Presenter face image/video
#    - Output: Lip-synced face video
#
# 5. Execute Command Node: Video Composition
#    - Command: FFmpeg to overlay face on background
#    - Input: Background image, Lip-synced face video
#    - Output: Final composite video
#
# 6. Telegram Send Video Node
#    - Chat ID: {{ $json.message.chat.id }}
#    - Video: Final composite video file
#    - Caption: "Voici votre vidéo générée!"

Workflow Design: The N8N workflow orchestrates the entire process from message reception to video delivery. Each step should include error handling and logging for debugging purposes.

Step 3: TTS Chatterbox Integration

Configure TTS Chatterbox server connection in N8N
Set up voice profile using reference audio: scripts\utils\test_lips\girlstorytelling.mp3
Create API endpoint wrapper if needed (or use direct HTTP requests)
Implement text-to-speech conversion with French language support
Handle audio file generation and temporary storage
Validate audio output quality and duration

# TTS Chatterbox API Integration
# 
# Endpoint: http://localhost:PORT/api/tts
# Method: POST
# Content-Type: application/json
#
# Request Body:
# {
#   "text": "Bonjour, voici votre message en français.",
#   "voice": "girlstorytelling",
#   "language": "fr",
#   "output_format": "mp3",
#   "sample_rate": 16000
# }
#
# Response:
# - Audio file (binary) or file path
# - Duration information
# - File metadata
#
# Voice Profile Setup:
# - Reference audio: scripts\utils\test_lips\girlstorytelling.mp3
# - Voice name: "girlstorytelling" or custom name
# - Language: French (fr)
# - Voice characteristics: Female, storytelling style

TTS Configuration: The TTS Chatterbox server is already installed on the machine. The voice profile should be configured to match the characteristics of the reference audio file (girlstorytelling.mp3). Ensure French language support is enabled.

⚠️ Important: The TTS Chatterbox server must be running and accessible before the N8N workflow can process requests. Implement health check logic in N8N to verify server availability.

Step 4: Lip Synchronization Process

Prepare presenter face image/video for lip sync
Convert TTS audio to appropriate format (WAV, 16kHz, mono) if needed
Execute Wav2Lip processing using conda environment
Generate lip-synced face video
Validate output video quality and synchronization accuracy
Handle processing errors and timeouts

# Lip Sync Processing Script
# 
# Input Files:
# - Audio: TTS-generated audio (from Step 3)
# - Face: Presenter face image/video (e.g., docs_website\assets\images\TV_presenter_FaceZoom.jpg)
#
# Processing Command:
powershell -ExecutionPolicy Bypass -File scripts\utils\run_lip_sync.ps1 python third_party\Wav2Lip\inference.py `
  --checkpoint_path models\wav2lip\wav2lip.pth `
  --face "docs_website\assets\images\TV_presenter_FaceZoom.jpg" `
  --audio "{{ $json.tts_audio_path }}" `
  --outfile "{{ $json.temp_output_path }}\lip_synced_face.mp4" `
  --pads 0 20 0 0 `
  --face_det_batch_size 16 `
  --wav2lip_batch_size 128
#
# Output:
# - Lip-synced face video (274x276 resolution)
# - Duration matches audio length
# - Synchronized lip movements
#
# Note: Face image must be converted to video first if using static image
# ffmpeg -loop 1 -i "face.jpg" -t AUDIO_DURATION -vf "fps=25" -pix_fmt yuv420p "face_video.mp4"

Lip Sync Requirements: The Wav2Lip processing requires the conda lip_sync environment. Processing time varies based on video duration (approximately 30 seconds for 20 seconds of video). Consider implementing async processing or queue system for multiple concurrent requests.

⚠️ Performance Consideration: Lip synchronization processing can take significant time (30+ seconds for short videos). Implement user notification system to inform users that processing is in progress. Consider showing "Processing..." status message via Telegram.

Step 5: Video Composition

Load background image/video (TV studio background)
Get duration of lip-synced face video
Convert background image to video matching face video duration
Overlay lip-synced face video onto background
Position face video correctly on background
Generate final composite video
Optimize video for Telegram delivery (file size, format)

# Video Composition with FFmpeg
#
# Step 1: Get audio/face video duration
$faceVideoDuration = ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "lip_synced_face.mp4"
#
# Step 2: Convert background image to video
ffmpeg -loop 1 -i "docs_website\assets\images\TV_background03_HD_with_logo.jpg" `
  -t $faceVideoDuration `
  -vf "fps=25" `
  -pix_fmt yuv420p `
  "temp_background.mp4"
#
# Step 3: Overlay face video on background
ffmpeg -i "temp_background.mp4" `
  -i "lip_synced_face.mp4" `
  -filter_complex "[0:v][1:v]overlay=x=423:y=170" `
  -c:v libx264 `
  -pix_fmt yuv420p `
  -c:a copy `
  "final_composite_video.mp4"
#
# Step 4: Optimize for Telegram (optional - reduce file size)
ffmpeg -i "final_composite_video.mp4" `
  -c:v libx264 `
  -preset medium `
  -crf 23 `
  -maxrate 2M `
  -bufsize 4M `
  -c:a aac `
  -b:a 128k `
  "final_video_optimized.mp4"
#
# Output Specifications:
# - Resolution: 1920x1080 (1080p)
# - Frame rate: 25 fps
# - Codec: H.264/AVC
# - Container: MP4
# - Audio: AAC, 48kHz (or match TTS audio)

Video Composition: The final video combines the TV studio background with the lip-synced presenter face. Positioning coordinates (x=423, y=170) should match the presenter's location in the background image. Consider implementing automatic face detection for dynamic positioning.

Step 6: Telegram Response Delivery

Send processing status message to user (optional)
Upload final video file to Telegram
Send video message with caption to user
Handle file size limitations (Telegram max 50MB for videos)
Implement error handling for failed uploads
Clean up temporary files after successful delivery

# Telegram Video Delivery
#
# N8N Telegram Send Video Node Configuration:
# - Chat ID: {{ $json.message.chat.id }}
# - Video File: {{ $json.final_video_path }}
# - Caption: "Voici votre vidéo générée! 🎬"
# - Supports MP4, MOV, AVI formats
# - Maximum file size: 50MB
#
# Error Handling:
# - If file > 50MB: Compress video further or split into parts
# - If upload fails: Retry with exponential backoff
# - Send error message to user if delivery fails after retries
#
# Cleanup:
# - Delete temporary audio files
# - Delete temporary video files
# - Keep final video for X hours (optional: for user re-download)
# - Log all operations for debugging

Delivery Process: The final video is sent back to the user via Telegram. Ensure proper error handling for network issues, file size limitations, and Telegram API rate limits. Consider implementing a status update system to keep users informed during processing.

Step 7: Error Handling & Queue Management

Implement request queue system for concurrent users
Add error handling for each workflow step
Create retry logic for failed operations
Implement timeout handling for long-running processes
Add logging and monitoring for debugging
Create user notification system for errors
Implement rate limiting to prevent abuse

# Error Handling Strategy
#
# Queue Management:
# - Use N8N queue system or external queue (Redis, RabbitMQ)
# - Limit concurrent processing (e.g., max 2-3 simultaneous requests)
# - Implement priority queue if needed
#
# Error Scenarios:
# 1. Invalid input (non-text, non-French)
#    → Send error message: "Veuillez envoyer un texte en français."
#
# 2. TTS Chatterbox server unavailable
#    → Retry 3 times with 5-second delays
#    → If still fails: "Service temporairement indisponible. Réessayez plus tard."
#
# 3. Lip sync processing timeout
#    → Set timeout: 5 minutes
#    → If timeout: "Le traitement prend plus de temps que prévu. Réessayez avec un texte plus court."
#
# 4. Video composition failure
#    → Log error details
#    → Send generic error: "Erreur lors de la génération de la vidéo."
#
# 5. Telegram upload failure
#    → Retry upload 3 times
#    → If file too large: Compress and retry
#    → If still fails: Provide download link (alternative delivery method)
#
# Logging:
# - Log all requests with timestamp, user_id, text_length
# - Log processing times for each step
# - Log errors with full stack traces
# - Store logs in files or database for analysis

Reliability: Robust error handling ensures the system can gracefully handle failures and provide meaningful feedback to users. Queue management prevents system overload and ensures fair processing of requests.

Step 8: File Management & Cleanup

Create temporary file storage structure
Implement unique file naming (UUID or timestamp-based)
Set up automatic cleanup for temporary files
Configure retention policy for generated videos
Implement disk space monitoring

# File Management Structure
#
# Directory Structure:
# FTTNW/
#     temp/
#       audio/          # TTS-generated audio files
#       face_videos/    # Lip-synced face videos
#       composite/      # Final composite videos
#     processed/        # Successfully processed videos (optional retention)
#     failed/          # Failed processing attempts (for debugging)
#
# File Naming Convention:
# - Audio: {timestamp}_{user_id}_{hash}.mp3
# - Face video: {timestamp}_{user_id}_face.mp4
# - Final video: {timestamp}_{user_id}_final.mp4
#
# Cleanup Policy:
# - Delete temp files immediately after successful delivery
# - Keep processed videos for 24 hours (optional)
# - Keep failed processing files for 7 days (for debugging)
# - Monitor disk space and alert if > 80% full
#
# Cleanup Script (run periodically):
# - Delete files older than retention period
# - Calculate total disk usage
# - Send alerts if disk space critical

Storage Management: Proper file management prevents disk space issues and ensures temporary files don't accumulate. Consider implementing automated cleanup scripts that run periodically.

Technical Requirements Summary

Telegram Bot: Bot token, webhook configuration, channel/group setup
N8N: Workflow configuration, Telegram nodes, HTTP request nodes, execute command nodes
TTS Chatterbox: Server running, voice profile configured, API endpoint accessible
Wav2Lip: Conda environment (lip_sync), models downloaded, presenter face image/video
FFmpeg: Installed and accessible, video composition scripts
Storage: Temporary file directories, disk space monitoring
Monitoring: Logging system, error tracking, performance metrics

Implementation Checklist

☐ Set up Telegram bot and obtain bot token
☐ Configure Telegram webhook in N8N
☐ Create N8N workflow with all required nodes
☐ Configure TTS Chatterbox voice profile using reference audio
☐ Test TTS Chatterbox API integration
☐ Prepare presenter face image/video for lip sync
☐ Test Wav2Lip processing pipeline
☐ Test FFmpeg video composition
☐ Implement error handling and retry logic
☐ Set up file management and cleanup system
☐ Test end-to-end workflow with sample French text
☐ Implement queue management for concurrent requests
☐ Add logging and monitoring
☐ Optimize video file sizes for Telegram delivery
☐ Create user documentation and usage instructions

Future Enhancements

Support for multiple languages (not just French)
Multiple presenter options (user selects presenter)
Custom background selection
Video quality options (HD, SD)
Batch processing for multiple texts
Video editing features (subtitles, effects)
User authentication and usage limits
Analytics dashboard for usage statistics
Webhook integration for external systems
Real-time processing status updates