Automated system that receives French text via Telegram, generates speech, synchronizes presenter lips, and returns a video response.
scripts\utils\test_lips\girlstorytelling.mp3telegram-bot-config.md. This includes ready-to-use French messages, command descriptions, and all text content needed for BotFather setup.
# Telegram Bot Configuration # 1. Create bot via @BotFather on Telegram # 2. Bot token: 8596255366:AAEI8PJezO4e30LmscKfo4Q3gVfVa_lVHRU # 3. Configure webhook URL in N8N Telegram node # 4. Set up message filter for text messages only # N8N Telegram Trigger Node Configuration: # - Credentials: Telegram Bot Token (8596255366:AAEI8PJezO4e30LmscKfo4Q3gVfVa_lVHRU) # - Update Type: "message" # - Additional Fields: "text" (to extract message content) # - Chat ID: Store for response delivery
Telegram Setup: The Telegram bot will act as the entry point for user requests. Messages sent to the configured channel will trigger the N8N workflow. The bot token must be securely stored in N8N credentials vault. See telegram-bot-config.md for complete bot configuration including name, description, commands, and all French messages.
# N8N Workflow Structure:
#
# 1. Telegram Trigger Node
# - Receives message from Telegram channel
# - Extracts: message.text, message.chat.id, message.from.id
#
# 2. Function Node: Validate Input
# - Check if message.text exists
# - Validate French language (optional: use language detection API)
# - Return error message if validation fails
#
# 3. HTTP Request Node: TTS Chatterbox
# - Method: POST
# - URL: http://localhost:PORT/api/tts (TTS Chatterbox endpoint)
# - Body: { "text": "{{ $json.message.text }}", "voice": "girlstorytelling" }
# - Response: Audio file (MP3/WAV)
#
# 4. Execute Command Node: Lip Sync Processing
# - Command: PowerShell script to run Wav2Lip
# - Input: Audio file from TTS, Presenter face image/video
# - Output: Lip-synced face video
#
# 5. Execute Command Node: Video Composition
# - Command: FFmpeg to overlay face on background
# - Input: Background image, Lip-synced face video
# - Output: Final composite video
#
# 6. Telegram Send Video Node
# - Chat ID: {{ $json.message.chat.id }}
# - Video: Final composite video file
# - Caption: "Voici votre vidéo générée!"
Workflow Design: The N8N workflow orchestrates the entire process from message reception to video delivery. Each step should include error handling and logging for debugging purposes.
scripts\utils\test_lips\girlstorytelling.mp3# TTS Chatterbox API Integration
#
# Endpoint: http://localhost:PORT/api/tts
# Method: POST
# Content-Type: application/json
#
# Request Body:
# {
# "text": "Bonjour, voici votre message en français.",
# "voice": "girlstorytelling",
# "language": "fr",
# "output_format": "mp3",
# "sample_rate": 16000
# }
#
# Response:
# - Audio file (binary) or file path
# - Duration information
# - File metadata
#
# Voice Profile Setup:
# - Reference audio: scripts\utils\test_lips\girlstorytelling.mp3
# - Voice name: "girlstorytelling" or custom name
# - Language: French (fr)
# - Voice characteristics: Female, storytelling style
TTS Configuration: The TTS Chatterbox server is already installed on the machine. The voice profile should be configured to match the characteristics of the reference audio file (girlstorytelling.mp3). Ensure French language support is enabled.
# Lip Sync Processing Script
#
# Input Files:
# - Audio: TTS-generated audio (from Step 3)
# - Face: Presenter face image/video (e.g., docs_website\assets\images\TV_presenter_FaceZoom.jpg)
#
# Processing Command:
powershell -ExecutionPolicy Bypass -File scripts\utils\run_lip_sync.ps1 python third_party\Wav2Lip\inference.py `
--checkpoint_path models\wav2lip\wav2lip.pth `
--face "docs_website\assets\images\TV_presenter_FaceZoom.jpg" `
--audio "{{ $json.tts_audio_path }}" `
--outfile "{{ $json.temp_output_path }}\lip_synced_face.mp4" `
--pads 0 20 0 0 `
--face_det_batch_size 16 `
--wav2lip_batch_size 128
#
# Output:
# - Lip-synced face video (274x276 resolution)
# - Duration matches audio length
# - Synchronized lip movements
#
# Note: Face image must be converted to video first if using static image
# ffmpeg -loop 1 -i "face.jpg" -t AUDIO_DURATION -vf "fps=25" -pix_fmt yuv420p "face_video.mp4"
Lip Sync Requirements: The Wav2Lip processing requires the conda lip_sync environment. Processing time varies based on video duration (approximately 30 seconds for 20 seconds of video). Consider implementing async processing or queue system for multiple concurrent requests.
# Video Composition with FFmpeg # # Step 1: Get audio/face video duration $faceVideoDuration = ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "lip_synced_face.mp4" # # Step 2: Convert background image to video ffmpeg -loop 1 -i "docs_website\assets\images\TV_background03_HD_with_logo.jpg" ` -t $faceVideoDuration ` -vf "fps=25" ` -pix_fmt yuv420p ` "temp_background.mp4" # # Step 3: Overlay face video on background ffmpeg -i "temp_background.mp4" ` -i "lip_synced_face.mp4" ` -filter_complex "[0:v][1:v]overlay=x=423:y=170" ` -c:v libx264 ` -pix_fmt yuv420p ` -c:a copy ` "final_composite_video.mp4" # # Step 4: Optimize for Telegram (optional - reduce file size) ffmpeg -i "final_composite_video.mp4" ` -c:v libx264 ` -preset medium ` -crf 23 ` -maxrate 2M ` -bufsize 4M ` -c:a aac ` -b:a 128k ` "final_video_optimized.mp4" # # Output Specifications: # - Resolution: 1920x1080 (1080p) # - Frame rate: 25 fps # - Codec: H.264/AVC # - Container: MP4 # - Audio: AAC, 48kHz (or match TTS audio)
Video Composition: The final video combines the TV studio background with the lip-synced presenter face. Positioning coordinates (x=423, y=170) should match the presenter's location in the background image. Consider implementing automatic face detection for dynamic positioning.
# Telegram Video Delivery
#
# N8N Telegram Send Video Node Configuration:
# - Chat ID: {{ $json.message.chat.id }}
# - Video File: {{ $json.final_video_path }}
# - Caption: "Voici votre vidéo générée! 🎬"
# - Supports MP4, MOV, AVI formats
# - Maximum file size: 50MB
#
# Error Handling:
# - If file > 50MB: Compress video further or split into parts
# - If upload fails: Retry with exponential backoff
# - Send error message to user if delivery fails after retries
#
# Cleanup:
# - Delete temporary audio files
# - Delete temporary video files
# - Keep final video for X hours (optional: for user re-download)
# - Log all operations for debugging
Delivery Process: The final video is sent back to the user via Telegram. Ensure proper error handling for network issues, file size limitations, and Telegram API rate limits. Consider implementing a status update system to keep users informed during processing.
# Error Handling Strategy # # Queue Management: # - Use N8N queue system or external queue (Redis, RabbitMQ) # - Limit concurrent processing (e.g., max 2-3 simultaneous requests) # - Implement priority queue if needed # # Error Scenarios: # 1. Invalid input (non-text, non-French) # → Send error message: "Veuillez envoyer un texte en français." # # 2. TTS Chatterbox server unavailable # → Retry 3 times with 5-second delays # → If still fails: "Service temporairement indisponible. Réessayez plus tard." # # 3. Lip sync processing timeout # → Set timeout: 5 minutes # → If timeout: "Le traitement prend plus de temps que prévu. Réessayez avec un texte plus court." # # 4. Video composition failure # → Log error details # → Send generic error: "Erreur lors de la génération de la vidéo." # # 5. Telegram upload failure # → Retry upload 3 times # → If file too large: Compress and retry # → If still fails: Provide download link (alternative delivery method) # # Logging: # - Log all requests with timestamp, user_id, text_length # - Log processing times for each step # - Log errors with full stack traces # - Store logs in files or database for analysis
Reliability: Robust error handling ensures the system can gracefully handle failures and provide meaningful feedback to users. Queue management prevents system overload and ensures fair processing of requests.
# File Management Structure
#
# Directory Structure:
# FTTNW/
# temp/
# audio/ # TTS-generated audio files
# face_videos/ # Lip-synced face videos
# composite/ # Final composite videos
# processed/ # Successfully processed videos (optional retention)
# failed/ # Failed processing attempts (for debugging)
#
# File Naming Convention:
# - Audio: {timestamp}_{user_id}_{hash}.mp3
# - Face video: {timestamp}_{user_id}_face.mp4
# - Final video: {timestamp}_{user_id}_final.mp4
#
# Cleanup Policy:
# - Delete temp files immediately after successful delivery
# - Keep processed videos for 24 hours (optional)
# - Keep failed processing files for 7 days (for debugging)
# - Monitor disk space and alert if > 80% full
#
# Cleanup Script (run periodically):
# - Delete files older than retention period
# - Calculate total disk usage
# - Send alerts if disk space critical
Storage Management: Proper file management prevents disk space issues and ensures temporary files don't accumulate. Consider implementing automated cleanup scripts that run periodically.
lip_sync), models downloaded, presenter face image/video