Skip to Content

What is Apache Hop®​?

Apache Hop Overview


Apache Hop is the open-source standard for visual data orchestration and pipeline development. Learn how it works, when to use it, and how to run it in production with Putki.





Apache Hop is an open-source data orchestration platform built for teams who need to move, transform, and manage data across business systems, reliably, repeatably, and without being locked into a single infrastructure. It is a top-level Apache Software Foundation project and the modern successor to Pentaho Data Integration (PDI/Kettle).

Get started with Apache Hop


What is Apache Hop?

Apache Hop is a powerful, open-source data orchestration platform designed to help teams visually author, run, and monitor data pipelines and workflows. It's used for building complex integrations that require data transformation, workflow orchestration, and reliable execution across a wide range of systems, from relational databases and REST APIs to ERP platforms and cloud services.

At the heart of Apache Hop are two core concepts: pipelines and workflows. Pipelines perform the core data processing tasks, extracting, transforming, and loading data between systems. Workflows handle orchestration-level logic: running pipelines in sequence, managing errors, moving files, sending notifications, and coordinating execution across environments. Together, they provide a complete runtime for any data integration scenario.

Unlike code-first orchestration tools, Apache Hop uses a visual, metadata-driven approach. Pipelines and workflows are built using a drag-and-drop IDE, making complex logic accessible without requiring deep programming expertise, while still offering the power and flexibility that experienced engineers expect.


History of Apache Hop

If you've worked in data integration for more than a few years, you've probably encountered Pentaho Data Integration, also known as Kettle. For over a decade, PDI was one of the most widely used ETL tools in the world. It was powerful, visual, and approachable. It was also showing its age.

In 2019, a group of engineers, several of whom had spent years building and contributing to PDI, decided to start over. Not from scratch, but with intention. The result was Apache Hop: a complete architectural rethink that kept what made PDI great and replaced everything that held it back. The project entered the Apache Software Foundation incubator in 2020 and graduated as a top-level project in 2021.

The rebuild wasn't cosmetic. Apache Hop introduced a fully plugin-based engine, a redesigned IDE, a clean separation between pipeline logic and environment configuration, and native support for multiple execution runtimes. For teams migrating from PDI, the transition is familiar. For teams starting fresh, there's no legacy baggage to navigate.


What is Apache Hop used for?

Apache Hop is the tool of choice for data integration teams that need to connect, transform, and move data reliably across business systems. Data engineers use it to build pipelines that run on any infrastructure, locally, in Docker, on Kubernetes, or in the cloud, without modifying the pipeline logic itself.

Apache Hop's key features include:

Visual pipeline development: Build data pipelines using a rich, drag-and-drop IDE without writing boilerplate code. Workflow orchestration: Coordinate complex sequences of pipelines, scripts, and system operations with full error handling. Environment abstraction: Define pipelines once and run them across dev, test, and production environments using configuration variables. Runtime flexibility: Execute on local JVM, Docker, Kubernetes, Apache Spark, Apache Flink, or Apache Beam, all from the same pipeline definition. Extensibility: A fully plugin-based architecture means virtually every component can be extended or replaced.

These features make Apache Hop particularly well suited for:

  • ERP and CRM integrations: Connecting systems like Odoo, Salesforce, and HubSpot to data warehouses or reporting layers
  • ETL and ELT pipelines: Extracting, transforming, and loading data between relational databases, files, and APIs
  • Data migration projects: Moving data between legacy and modern systems with full transformation control
  • Operational data workflows: Automating file transfers, database maintenance, and system synchronisation tasks

Built on a strong and growing community


Key Features of Apache Hop

1. Visual, Metadata-Driven Development

Apache Hop's IDE provides a visual canvas for building pipelines and workflows. All pipeline logic is stored as metadata, not code, making it easy to version, share, and deploy across environments. Engineers who prefer code can work directly with the underlying XML metadata or integrate with standard Git workflows.

2. Architecture-Agnostic Execution

One of Apache Hop's most distinctive capabilities is its runtime flexibility. A pipeline built in the Hop IDE can be executed locally for development, containerised with Docker for testing, deployed to Kubernetes for production, or submitted to Apache Spark or Flink for large-scale data processing, all without changing a single transform.

3. Pipelines and Workflows

Apache Hop makes a clear architectural distinction between pipelines and workflows. Pipelines stream data through a series of transforms in parallel, optimised for throughput. Workflows orchestrate sequences of actions, running pipelines, executing scripts, copying files, sending notifications, with conditional logic and error handling built in.

4. Plugin-Based Extensibility

Every component in Apache Hop is a plugin: transforms, actions, database connections, execution engines, and log channels. This makes it straightforward to extend the platform for specific use cases, adding custom transforms, integrating with proprietary APIs, or building entirely new execution runtimes.

5. Environment and Configuration Management

Apache Hop separates pipeline logic from environment configuration using a system of projects and environments. The same pipeline runs in development, testing, and production by simply switching the active environment, no pipeline modifications required. This makes CI/CD integration natural and reduces configuration drift across environments.

6. Native Git Integration

Apache Hop stores all metadata as structured files, making it natively compatible with Git and any version control system. Teams can apply standard software engineering practices, branching, pull requests, code review, and automated deployment, to their data pipeline development.

Why Apache Hop is the Future of Visual Data Orchestration

Apache Hop's combination of visual development, architectural flexibility, and a modern plugin-based engine make it one of the most capable open-source data integration platforms available. Its ability to run the same pipeline across local, containerised, and distributed environments without modification is a capability few tools can match.

For teams coming from PDI/Kettle, the transition is natural. For teams evaluating modern data integration tooling from scratch, Apache Hop offers the rare combination of a low barrier to entry and genuine depth for complex use cases.

Putki: Apache Hop for the Enterprise

Putki is know.bi's production-ready distribution of Apache Hop. Putki closes the gap between the power of the open-source engine and the operational requirements of running it in a business environment.

Running Apache Hop in production requires more than a download. Teams need hardened builds they can trust, visibility into what their pipelines are actually doing, governance tooling to manage growing complexity, and someone to call when something breaks. Putki provides all of that, built and maintained by the same team that contributes to the Apache Hop core.

Putki adds the following on top of Apache Hop:

Security: Hardened Docker images, vulnerability scanning with every release, and patch releases for critical issues between Apache Hop major versions. Observability: Centralized execution logs, pipeline health dashboards, Grafana monitoring, and Slack alerting. Governance: Autodoc for automated technical documentation, SQL Parser for dependency extraction, and RDBMS Impact Analysis to understand the downstream effect of schema changes. Connectivity: 15+ maintained connectors for business systems including Odoo, Shopify, and HubSpot. Support: SLA-backed service desk, a 70+ article knowledge base, and advisory services from the team that built the platform.

Getting Started with Apache Hop

The fastest way to get started with Apache Hop is to download the Hop IDE and follow the official documentation at hop.apache.org. The IDE runs on Windows, Linux, and macOS and requires only a supported Java runtime.

For teams that need a production-ready starting point, with security hardening, operational tooling, and commercial support already in place, Putki provides a distribution of Apache Hop that is ready to run from day one.

Why Putki is Ideal for Apache Hop in Production

Putki is know.bi's enterprise distribution of Apache Hop. know.bi's founders are active contributors to the Apache Hop project, they've been involved since before it had that name. Putki exists because they spent years watching capable teams struggle not with Apache Hop itself, but with everything around it.

Putki doesn't replace Apache Hop. It takes the same engine and wraps it in the operational layer that production environments require:

  • Security and hardening: Every Putki release is scanned for vulnerabilities using Trivy and Docker Scout. Images are hardened, experimental plugins removed, attack surface reduced, secure defaults applied. Patch releases cover critical security issues between Apache Hop major versions. The current release carries no CVEs above score 9.
  • Observability: Centralized execution logs, pipeline health dashboards, Grafana monitoring, and Slack alerting. The difference between "the pipeline ran" and "the pipeline did what it was supposed to" is visibility , and that's what Putki adds.
  • Governance: Autodoc generates technical documentation directly from pipeline metadata. The SQL Parser extracts dependencies from scripts. RDBMS Impact Analysis shows which pipelines will break before you change a database schema, not after.
  • Connectivity: 15+ maintained connectors for business systems including Odoo, Shopify, and HubSpot, with built-in handling for API versioning, pagination, and rate limiting.
  • Support: A service desk with guaranteed response times, a structured knowledge base, and advisory services from people who understand the platform at a level that comes from having built it.

Putki is available as an annual subscription across Basic, Professional, and Enterprise tiers, priced on the level of observability, governance, and support required, not on data volume or pipeline count.