File format recognition¶
Overview¶
tg-note supports automatic recognition and processing of various file formats using Docling. When you send a file to the bot, the system automatically extracts text content and integrates it into your knowledge base.
Supported formats¶
Documents¶
- PDF (
.pdf) - Word (
.docx) - PowerPoint (
.pptx) - Excel (
.xlsx)
Text files¶
- Markdown (
.md) - HTML (
.html) - Plain Text (
.txt)
Images¶
- JPEG (
.jpg,.jpeg) - PNG (
.png) - TIFF (
.tiff)
How it works¶
Automatic processing¶
- Send a file to the bot (attachment or forwarded)
- The bot downloads the file to a temporary directory
- Docling processes the file and extracts text
- The content is merged with the message text for analysis
- The result is saved to the knowledge base
Example¶
# Just send a file to the bot
# The bot will:
# 1. Detect file format
# 2. Extract content
# 3. Analyze with the AI agent
# 4. Save to the KB with proper structure
Architecture¶
Components¶
FileProcessor(src/processor/file_processor.py)- Manages file processing
- Docling integration
- Telegram file download
-
Temporary storage
-
ContentParser(src/processor/content_parser.py) - Adds
parse_group_with_files() - Merges file content with message text
-
Async processing
-
BotHandlers(src/bot/handlers.py) - Uses new method to process files
- Supports documents and photos
- Automatic temp files cleanup
Processing flow¶
┌─────────────────┐
│ Telegram Message│
│ with File │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Bot Handlers │
│ (download file) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ File Processor │
│ (encode base64) │
└────────┬────────┘
│ HTTP/SSE (MCP Protocol)
▼
┌─────────────────────────────────┐
│ Docling MCP Container │
│ (docling-mcp==1.3.2) │
│ │
│ ┌──────────────────────────┐ │
│ │ convert_document_from_ │ │
│ │ content (base64) │ │
│ └───────────┬──────────────┘ │
│ │ │
│ ┌───────────▼──────────────┐ │
│ │ DocumentConverter │ │
│ │ (original docling-mcp) │ │
│ └───────────┬──────────────┘ │
└──────────────┼──────────────────┘
│
│ Extracted text + metadata
▼
┌─────────────────┐
│ Content Parser │
│ (merge content) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ AI Agent │
│ (analysis) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Knowledge Base │
│ (save) │
└─────────────────┘
Key points:
- Files are transferred via base64 encoding (no shared filesystem needed)
- Uses original docling-mcp package with minimal wrappers
- MCP protocol over HTTP/SSE for communication
- Automatic tool detection and configuration
Installation¶
Docling runs as an MCP server using the official docling-mcp package. Docker Compose includes the docling-mcp service by default, so starting the stack is enough to enable document processing.
Architecture¶
The Docling integration uses the original docling-mcp==1.3.2 package with minimal wrappers for tg-note integration:
- Original docling-mcp: All core functionality comes from the official package
- Base64 file transfer: Files are sent via
convert_document_from_contenttool (no shared filesystem needed) - Automatic registration: MCP Hub automatically registers the Docling server on startup
- Qwen CLI integration: Docling is automatically configured in
~/.qwen/settings.json
The Docling container is built from the repository (docker/docling-mcp/Dockerfile) with GPU support.
Model artefacts and configuration are persisted under:
config.yaml– shared configuration file used by both bot and Docling containerdata/docling/models– downloaded OCR/VLM modelsdata/docling/cache– HuggingFace / ModelScope cacheslogs/docling– container logs
File Transfer in Docker Mode¶
In Docker mode, files are transferred via base64 encoding through the convert_document_from_content MCP tool. This approach:
- ✅ Works without shared filesystem between containers
- ✅ Supports all file formats (PDF, DOCX, images, etc.)
- ✅ Uses the original docling-mcp conversion pipeline
- ✅ Maintains caching and performance optimizations
The bot automatically:
1. Downloads files from Telegram
2. Encodes them as base64
3. Sends to Docling MCP server via convert_document_from_content
4. Receives extracted text and metadata
Configuration¶
Changing Docling settings via /settings updates the shared configuration and automatically triggers
model downloads. Progress updates are sent back to the Telegram chat, so no separate command is
required.
Docling settings expose detailed controls under MEDIA_PROCESSING_DOCLING:
startup_sync: enable/disable automatic downloads on container startkeep_images/generate_page_images: embed page snapshots in the outputocr_config: choose betweenrapidocr,easyocr,tesseract,tesseract_cli, oronnxtrmodel_cache.builtin_models: Docling-managed bundles to pre-download (layout, RapidOCR, EasyOCR, VLMs, etc.)model_cache.downloads: optional additional artefacts fetched from HuggingFace or ModelScopepipeline: enables or disables Docling pipeline stages (table structure, code/formula enrichment, picture classification, VLM descriptions)
The default configuration ships with RapidOCR (GPU-enabled via ONNX Runtime). Switch to EasyOCR
or Tesseract by updating ocr_config.backend and adjusting backend-specific sections.
Local fallback (optional)¶
If you prefer to run Docling inside the bot process, install the Python package manually and set the
backend to local in config.yaml:
Verify installation¶
from src.processor.file_processor import FileProcessor
processor = FileProcessor()
if processor.is_available():
print("Docling available!")
print(f"Supported formats: {processor.get_supported_formats()}")
else:
print("Docling not available")
Examples¶
PDF document¶
1. Send a PDF file
2. Bot replies: "🔄 Processing message..."
3. After processing you get details:
✅ Saved successfully!
📁 File: research-paper-2024-10-04.md
📂 Category: science/research
🏷 Tags: pdf, research, ai
Image with text¶
1. Send an image (screenshot or document photo)
2. Docling extracts text from the image
3. The text is analyzed and saved to the KB
Multiple files¶
1. Send multiple files in a row
2. Bot groups them (30 seconds)
3. All files are processed and merged into one note
Error handling¶
Unsupported format¶
- The bot still tries to extract text
- Processes the rest of the message content
- Does not abort processing
File processing error¶
- Error is logged
- User can be notified (optional)
- Other content is still processed
Temporary files¶
- Temporary directories are created
- Files are cleaned up after processing
- Exceptions on cleanup are handled
Settings¶
File format recognition works out of the box. You can customize whether it's enabled and which formats are processed:
Enabling/Disabling media processing¶
You can completely enable or disable media file processing using the master switch:
# config.yaml
# Enable media processing (default)
MEDIA_PROCESSING_ENABLED: true
# Disable all media processing
MEDIA_PROCESSING_ENABLED: false
When MEDIA_PROCESSING_ENABLED is set to false, all file processing is disabled regardless of other settings.
Docling configuration¶
Docling behaviour is configured via the MEDIA_PROCESSING_DOCLING block in config.yaml. Key options:
enabled: Docling master switch (in addition toMEDIA_PROCESSING_ENABLED)backend:"mcp"(default) or"local"to force the in-process DocumentConverterformats: list of allowed file extensions (lowercase, no dots)max_file_size_mb: per-file size limit (set to0to disable the limit)prefer_markdown_output: prefer Markdown export when possiblefallback_plain_text: automatically fallback to text export when preferred export failsimage_ocr_enabled: allow OCR for image formats (jpg,jpeg,png,tiff)ocr_languages: OCR language hints (ISO 639-3 codes such aseng,deu)mcp: nested MCP configuration (server_name,url,tool_name,auto_detect_tool, etc.)
# config.yaml
MEDIA_PROCESSING_DOCLING:
enabled: true
backend: mcp
max_file_size_mb: 25
prefer_markdown_output: true
fallback_plain_text: true
image_ocr_enabled: true
ocr_languages:
- eng
formats:
- pdf
- docx
- pptx
- xlsx
- html
- md
- txt
- jpg
- jpeg
- png
- tiff
mcp:
server_name: docling
transport: sse
url: http://docling-mcp:8077/sse # Override for custom deployments
tool_name: convert_document_from_content
auto_detect_tool: true
💡 Base64 Transfer: The default tool
convert_document_from_contentaccepts base64-encoded payloads, allowing tg-note to send documents directly to the Docling MCP container without relying on shared filesystem paths. This is the recommended approach for Docker deployments. The classicconvert_documenttool remains available for path-based workflows but requires shared volumes.Original Package: The integration uses the official
docling-mcp==1.3.2package with minimal wrappers: - Original MCP server entrypoint (docling_mcp.servers.mcp_server.main) - Original conversion tools and caching - Additionalconvert_document_from_contenttool for base64 transfer - Custom converter configuration for OCR settings
Enabling/Disabling specific formats¶
Customise the formats list (and optionally image_ocr_enabled) to restrict Docling:
# Enable only documents (no images)
MEDIA_PROCESSING_DOCLING:
enabled: true
image_ocr_enabled: false
formats:
- pdf
- docx
- pptx
- xlsx
# Enable only specific formats
MEDIA_PROCESSING_DOCLING:
formats:
- pdf
- jpg
- png
# Disable Docling entirely
MEDIA_PROCESSING_DOCLING:
enabled: false
Legacy note: The older
MEDIA_PROCESSING_DOCLING_FORMATSlist is still supported for backward compatibility, but new deployments should migrate to the richerMEDIA_PROCESSING_DOCLINGconfiguration block.
Message grouping timeout¶
Media processing configuration structure¶
The media processing configuration has a hierarchical structure:
# Master switch - controls all media processing
MEDIA_PROCESSING_ENABLED: true
# Per-framework format configuration (new structure)
MEDIA_PROCESSING_DOCLING:
enabled: true
formats:
- pdf
- jpg
- ...
# Future frameworks can be added
# MEDIA_PROCESSING_SOME_OTHER_FRAMEWORK_FORMATS:
# - mp3
# - mp4
# - ...
# (Deprecated) Older list-only syntax is still recognised:
# MEDIA_PROCESSING_DOCLING_FORMATS:
# - pdf
# - jpg
# - ...
Configuration examples¶
# Example 1: Completely disable media processing
MEDIA_PROCESSING_ENABLED: false
# When disabled, format lists are ignored
# Example 2: Enable only PDF processing
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
formats:
- pdf
# Example 3: Enable documents but not images
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
image_ocr_enabled: false
formats:
- pdf
- docx
- pptx
- xlsx
# Example 4: Enable only images (OCR)
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
formats:
- jpg
- jpeg
- png
- tiff
Advanced usage¶
Programmatic access¶
from pathlib import Path
from src.processor.file_processor import FileProcessor
async def process_my_file():
processor = FileProcessor()
if not processor.is_available():
print("Docling not available")
return
result = await processor.process_file(Path("my_document.pdf"))
if result:
print(f"Extracted {len(result['text'])} chars")
print(f"Metadata: {result['metadata']}")
print(f"Text: {result['text'][:100]}...")
Agent integration¶
# in content_parser.py
content = await self.content_parser.parse_group_with_files(group, bot=self.bot)
# content['text'] contains:
# - Message text
# - Extracted file content
# - File metadata
Performance¶
Optimization¶
- Async IO for file operations
- Temporary files cleaned automatically
- Sequential but efficient handling of multiple files
Telegram limits¶
- Max file size: 20 MB (bots)
- Files hosted by Telegram temporarily
- Download speed depends on network
Debugging¶
Enable detailed logging:
You will see: - File download progress - Docling results - Errors and warnings - Cleanup operations
Check Docling¶
import logging
logging.basicConfig(level=logging.DEBUG)
from src.processor.file_processor import FileProcessor
processor = FileProcessor()
print(f"Docling available: {processor.is_available()}")
print(f"Supported: {processor.get_supported_formats()}")
Known issues¶
- Large files (>10 MB) may take longer
- Low-quality images reduce OCR quality
- Complex PDF layout may require extra processing
Support¶
- Docs: https://artyomzemlyak.github.io/tg-note/
- Issues: https://github.com/ArtyomZemlyak/tg-note/issues
- Discussions: https://github.com/ArtyomZemlyak/tg-note/discussions
Roadmap¶
- ✅ Basic file support
- 🚧 Audio/video files
- 📋 Better table extraction
- 📋 Archive support (.zip, .tar.gz)
- 📋 Batch processing
- 📋 Caching