Bridging the gap between unstructured data and actionable insights through interactive, agent-driven pipelines.
- DocETL Overview: A specialized tool designed for creating and executing data processing pipelines, specifically optimized for complex document processing tasks using AI agents.
- Interactive Development: Features DocWrangler, an interactive UI playground that allows users to experiment with prompts, build pipelines step-by-step, and see real-time results before exporting for production.
- Production Ready: Offers a robust Python package for running finalized pipelines, capable of handling sophisticated tasks like analyzing medical transcripts and resolving entity ambiguity.
The landscape of data engineering is shifting. As organizations accumulate vast oceans of unstructured data—from legal contracts to medical records—traditional Extract, Transform, Load (ETL) tools often fall short. They struggle with the nuance required to “read” and “understand” text. Enter DocETL, a groundbreaking new Python library that introduces agentic data processing to the ETL workflow. By leveraging AI to parse, analyze, and structure complex documents, DocETL transforms static files into dynamic, queryable assets.
At its core, DocETL is a comprehensive tool for creating and executing data processing pipelines. Unlike standard ETL tools that focus on moving rows and columns between databases, DocETL is engineered for complex document processing tasks. It treats data processing not just as a mechanical transfer, but as an intellectual task requiring comprehension.
The ecosystem is built around two main components designed to support the full lifecycle of a data pipeline, from the initial spark of an idea to full-scale deployment:
- DocWrangler: An interactive UI for experimentation.
- The DocETL Python Package: A robust engine for production execution.
DocWrangler: The Interactive Playground
Developing AI pipelines can often feel like a “black box” experience—you write a prompt, run a script, and hope for the best. DocWrangler eliminates this uncertainty. It is an interactive UI playground recommended for the development phase of your project.
DocWrangler empowers data engineers to iteratively develop their pipelines with immediate feedback. Users can experiment with different prompts and watch the results appear in real-time, allowing for rapid tuning of the AI’s instructions. You can build your pipeline step-by-step, ensuring each logic gate functions correctly before moving to the next. Once the workflow is perfected, DocWrangler allows you to export your finalized pipeline configuration seamlessly for production use.
Getting started with DocWrangler is flexible. It is hosted publicly at docetl.org/playground, but for those requiring data privacy or offline capabilities, it can be run locally. The recommended method for a quick local start is using Docker (make docker), though a manual development environment setup is also available for granular control.
The Python Package: Powering Production
Once a pipeline is designed and tested in DocWrangler, the DocETL Python package takes over. This is the engine used to run your production pipelines at scale. It integrates directly into your existing Python infrastructure, requiring Python 3.10 or later and a valid OpenAI API key to function.
The power of this package is best illustrated through its ability to handle intricate, multi-step logic. Consider a real-world scenario involving healthcare data: A user can create a pipeline that ingests raw medical transcripts. The agentic pipeline then identifies specific medications mentioned in the text, resolves similar or ambiguous names (entity resolution), and finally generates concise summaries detailing side effects and therapeutic uses. This level of semantic understanding and data synthesis is what sets DocETL apart from traditional text parsers.
A New Standard for Unstructured Data
DocETL represents a significant step forward in making unstructured data accessible. By combining an intuitive, visual development environment with a powerful, code-first production engine, it democratizes the ability to build sophisticated AI data agents. Whether you are analyzing financial reports, legal discovery documents, or clinical notes, DocETL provides the framework to turn text into truth.

