File format recognition¶

Overview¶

tg-note supports automatic recognition and processing of various file formats using Docling. When you send a file to the bot, the system automatically extracts text content and integrates it into your knowledge base.

Supported formats¶

Documents¶

PDF (.pdf)
Word (.docx)
PowerPoint (.pptx)
Excel (.xlsx)

Text files¶

Markdown (.md)
HTML (.html)
Plain Text (.txt)

Images¶

JPEG (.jpg, .jpeg)
PNG (.png)
TIFF (.tiff)

How it works¶

Automatic processing¶

Send a file to the bot (attachment or forwarded)
The bot downloads the file to a temporary directory
Docling processes the file and extracts text
The content is merged with the message text for analysis
The result is saved to the knowledge base

Example¶

# Just send a file to the bot
# The bot will:
# 1. Detect file format
# 2. Extract content
# 3. Analyze with the AI agent
# 4. Save to the KB with proper structure

Architecture¶

Components¶

FileProcessor (src/processor/file_processor.py)
Manages file processing
Docling integration
Telegram file download
Temporary storage
ContentParser (src/processor/content_parser.py)
Adds parse_group_with_files()
Merges file content with message text
Async processing
BotHandlers (src/bot/handlers.py)
Uses new method to process files
Supports documents and photos
Automatic temp files cleanup

Processing flow¶

┌─────────────────┐
│ Telegram Message│
│   with File     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Bot Handlers   │
│ (download file) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ File Processor  │
│ (encode base64) │
└────────┬────────┘
         │ HTTP/SSE (MCP Protocol)
         ▼
┌─────────────────────────────────┐
│      Docling MCP Container      │
│  (docling-mcp==1.3.2)          │
│                                 │
│  ┌──────────────────────────┐  │
│  │ convert_document_from_   │  │
│  │   content (base64)        │  │
│  └───────────┬──────────────┘  │
│              │                  │
│  ┌───────────▼──────────────┐  │
│  │ DocumentConverter        │  │
│  │ (original docling-mcp)   │  │
│  └───────────┬──────────────┘  │
└──────────────┼──────────────────┘
               │
               │ Extracted text + metadata
               ▼
┌─────────────────┐
│ Content Parser  │
│ (merge content) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   AI Agent      │
│  (analysis)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Knowledge Base  │
│    (save)       │
└─────────────────┘

Key points: - Files are transferred via base64 encoding (no shared filesystem needed) - Uses original docling-mcp package with minimal wrappers - MCP protocol over HTTP/SSE for communication - Automatic tool detection and configuration

Installation¶

Docling runs as an MCP server using the official docling-mcp package. Docker Compose includes the docling-mcp service by default, so starting the stack is enough to enable document processing.

# Start Docling MCP together with the hub and bot
docker compose up -d docling-mcp mcp-hub bot

Architecture¶

The Docling integration uses the original docling-mcp==1.3.2 package with minimal wrappers for tg-note integration:

Original docling-mcp: All core functionality comes from the official package
Base64 file transfer: Files are sent via convert_document_from_content tool (no shared filesystem needed)
Automatic registration: MCP Hub automatically registers the Docling server on startup
Qwen CLI integration: Docling is automatically configured in ~/.qwen/settings.json

The Docling container is built from the repository (docker/docling-mcp/Dockerfile) with GPU support. Model artefacts and configuration are persisted under:

config.yaml – shared configuration file used by both bot and Docling container
data/docling/models – downloaded OCR/VLM models
data/docling/cache – HuggingFace / ModelScope caches
logs/docling – container logs

File Transfer in Docker Mode¶

In Docker mode, files are transferred via base64 encoding through the convert_document_from_content MCP tool. This approach:

✅ Works without shared filesystem between containers
✅ Supports all file formats (PDF, DOCX, images, etc.)
✅ Uses the original docling-mcp conversion pipeline
✅ Maintains caching and performance optimizations

The bot automatically: 1. Downloads files from Telegram 2. Encodes them as base64 3. Sends to Docling MCP server via convert_document_from_content 4. Receives extracted text and metadata

Configuration¶

Changing Docling settings via /settings updates the shared configuration and automatically triggers model downloads. Progress updates are sent back to the Telegram chat, so no separate command is required.

Docling settings expose detailed controls under MEDIA_PROCESSING_DOCLING:

startup_sync: enable/disable automatic downloads on container start
keep_images / generate_page_images: embed page snapshots in the output
ocr_config: choose between rapidocr, easyocr, tesseract, tesseract_cli, or onnxtr
model_cache.builtin_models: Docling-managed bundles to pre-download (layout, RapidOCR, EasyOCR, VLMs, etc.)
model_cache.downloads: optional additional artefacts fetched from HuggingFace or ModelScope
pipeline: enables or disables Docling pipeline stages (table structure, code/formula enrichment, picture classification, VLM descriptions)

The default configuration ships with RapidOCR (GPU-enabled via ONNX Runtime). Switch to EasyOCR or Tesseract by updating ocr_config.backend and adjusting backend-specific sections.

Local fallback (optional)¶

If you prefer to run Docling inside the bot process, install the Python package manually and set the backend to local in config.yaml:

pip install docling

Verify installation¶

from src.processor.file_processor import FileProcessor

processor = FileProcessor()
if processor.is_available():
    print("Docling available!")
    print(f"Supported formats: {processor.get_supported_formats()}")
else:
    print("Docling not available")

Examples¶

PDF document¶

1. Send a PDF file
2. Bot replies: "🔄 Processing message..."
3. After processing you get details:
   ✅ Saved successfully!
   📁 File: research-paper-2024-10-04.md
   📂 Category: science/research
   🏷 Tags: pdf, research, ai

Image with text¶

1. Send an image (screenshot or document photo)
2. Docling extracts text from the image
3. The text is analyzed and saved to the KB

Multiple files¶

1. Send multiple files in a row
2. Bot groups them (30 seconds)
3. All files are processed and merged into one note

Error handling¶

Unsupported format¶

The bot still tries to extract text
Processes the rest of the message content
Does not abort processing

File processing error¶

Error is logged
User can be notified (optional)
Other content is still processed

Temporary files¶

Temporary directories are created
Files are cleaned up after processing
Exceptions on cleanup are handled

Settings¶

File format recognition works out of the box. You can customize whether it's enabled and which formats are processed:

Enabling/Disabling media processing¶

You can completely enable or disable media file processing using the master switch:

# config.yaml

# Enable media processing (default)
MEDIA_PROCESSING_ENABLED: true

# Disable all media processing
MEDIA_PROCESSING_ENABLED: false

When MEDIA_PROCESSING_ENABLED is set to false, all file processing is disabled regardless of other settings.

Docling configuration¶

Docling behaviour is configured via the MEDIA_PROCESSING_DOCLING block in config.yaml. Key options:

enabled: Docling master switch (in addition to MEDIA_PROCESSING_ENABLED)
backend: "mcp" (default) or "local" to force the in-process DocumentConverter
formats: list of allowed file extensions (lowercase, no dots)
max_file_size_mb: per-file size limit (set to 0 to disable the limit)
prefer_markdown_output: prefer Markdown export when possible
fallback_plain_text: automatically fallback to text export when preferred export fails
image_ocr_enabled: allow OCR for image formats (jpg, jpeg, png, tiff)
ocr_languages: OCR language hints (ISO 639-3 codes such as eng, deu)
mcp: nested MCP configuration (server_name, url, tool_name, auto_detect_tool, etc.)

# config.yaml

MEDIA_PROCESSING_DOCLING:
  enabled: true
  backend: mcp
  max_file_size_mb: 25
  prefer_markdown_output: true
  fallback_plain_text: true
  image_ocr_enabled: true
  ocr_languages:
    - eng
  formats:
    - pdf
    - docx
    - pptx
    - xlsx
    - html
    - md
    - txt
    - jpg
    - jpeg
    - png
    - tiff
  mcp:
    server_name: docling
    transport: sse
    url: http://docling-mcp:8077/sse   # Override for custom deployments
    tool_name: convert_document_from_content
    auto_detect_tool: true

💡 Base64 Transfer: The default tool convert_document_from_content accepts base64-encoded payloads, allowing tg-note to send documents directly to the Docling MCP container without relying on shared filesystem paths. This is the recommended approach for Docker deployments. The classic convert_document tool remains available for path-based workflows but requires shared volumes.

Original Package: The integration uses the official docling-mcp==1.3.2 package with minimal wrappers: - Original MCP server entrypoint (docling_mcp.servers.mcp_server.main) - Original conversion tools and caching - Additional convert_document_from_content tool for base64 transfer - Custom converter configuration for OCR settings

Enabling/Disabling specific formats¶

Customise the formats list (and optionally image_ocr_enabled) to restrict Docling:

# Enable only documents (no images)
MEDIA_PROCESSING_DOCLING:
  enabled: true
  image_ocr_enabled: false
  formats:
    - pdf
    - docx
    - pptx
    - xlsx

# Enable only specific formats
MEDIA_PROCESSING_DOCLING:
  formats:
    - pdf
    - jpg
    - png

# Disable Docling entirely
MEDIA_PROCESSING_DOCLING:
  enabled: false

Legacy note: The older MEDIA_PROCESSING_DOCLING_FORMATS list is still supported for backward compatibility, but new deployments should migrate to the richer MEDIA_PROCESSING_DOCLING configuration block.

Message grouping timeout¶

# config.yaml
MESSAGE_GROUP_TIMEOUT: 30  # seconds

Media processing configuration structure¶

The media processing configuration has a hierarchical structure:

# Master switch - controls all media processing
MEDIA_PROCESSING_ENABLED: true

# Per-framework format configuration (new structure)
MEDIA_PROCESSING_DOCLING:
  enabled: true
  formats:
    - pdf
    - jpg
    - ...

# Future frameworks can be added
# MEDIA_PROCESSING_SOME_OTHER_FRAMEWORK_FORMATS:
#   - mp3
#   - mp4
#   - ...

# (Deprecated) Older list-only syntax is still recognised:
# MEDIA_PROCESSING_DOCLING_FORMATS:
#   - pdf
#   - jpg
#   - ...

Configuration examples¶

# Example 1: Completely disable media processing
MEDIA_PROCESSING_ENABLED: false
# When disabled, format lists are ignored

# Example 2: Enable only PDF processing
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
  formats:
    - pdf

# Example 3: Enable documents but not images
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
  image_ocr_enabled: false
  formats:
    - pdf
    - docx
    - pptx
    - xlsx

# Example 4: Enable only images (OCR)
MEDIA_PROCESSING_ENABLED: true
MEDIA_PROCESSING_DOCLING:
  formats:
    - jpg
    - jpeg
    - png
    - tiff

Advanced usage¶

Programmatic access¶

from pathlib import Path
from src.processor.file_processor import FileProcessor

async def process_my_file():
    processor = FileProcessor()

    if not processor.is_available():
        print("Docling not available")
        return

    result = await processor.process_file(Path("my_document.pdf"))

    if result:
        print(f"Extracted {len(result['text'])} chars")
        print(f"Metadata: {result['metadata']}")
        print(f"Text: {result['text'][:100]}...")

Agent integration¶

# in content_parser.py
content = await self.content_parser.parse_group_with_files(group, bot=self.bot)

# content['text'] contains:
# - Message text
# - Extracted file content
# - File metadata

Performance¶

Optimization¶

Async IO for file operations
Temporary files cleaned automatically
Sequential but efficient handling of multiple files

Telegram limits¶

Max file size: 20 MB (bots)
Files hosted by Telegram temporarily
Download speed depends on network

Debugging¶

Enable detailed logging:

# config.yaml
LOG_LEVEL: DEBUG

You will see: - File download progress - Docling results - Errors and warnings - Cleanup operations

Check Docling¶

import logging
logging.basicConfig(level=logging.DEBUG)
from src.processor.file_processor import FileProcessor

processor = FileProcessor()
print(f"Docling available: {processor.is_available()}")
print(f"Supported: {processor.get_supported_formats()}")

Known issues¶

Large files (>10 MB) may take longer
Low-quality images reduce OCR quality
Complex PDF layout may require extra processing

Support¶

Docs: https://artyomzemlyak.github.io/tg-note/
Issues: https://github.com/ArtyomZemlyak/tg-note/issues
Discussions: https://github.com/ArtyomZemlyak/tg-note/discussions

Roadmap¶

✅ Basic file support
🚧 Audio/video files
📋 Better table extraction
📋 Archive support (.zip, .tar.gz)
📋 Batch processing
📋 Caching