Apache Flink is an open-source low-latency, high-throughput, distributed streaming data processing framework. It is currently developed by the Apache Flink Community which is part of the Apache Software Foundation. It has been written in Java and Scala and was initially released in 2011.
Apache Flink can be used to process streaming data in real-time, in batch, or both. This also allows you to power event-driven applications.
You would typically use Apache Flink when you need to process millions of records per minute in a single batch or in real-time. Along with this high throughput, Apache Flink also guarantees correctness and fault tolerance.
The main features of Apache Flink are:
- Streaming processor for both batch and real-time processing
- High throughput and very low latency
- Fault tolerance, if anything goes wrong, the processing can be resumed after a failure
- API components for common tasks
- Wide ecosystem compatibility including Apache Hadoop, Apache MapReduce, HBase, Spark, and more.
- Libraries for common tasks including Machine Learning, Graph Processing, complex event processing, and more.
- Highly Scalable and can be scaled to a large number of nodes.
With the above features in mind, some of the common use cases are:
- Event-driven Applications
- Data Analytics
- Data Pipelines
Here is a diagram of a traditional application compared to an event-driven application:
- Traditional traditional application:
- Event-driven application:
Apache Flink provides a wide range of functionality. The following sections provide a brief overview of the different components of Apache Flink.
There are multiple storage backends available for Apache Flink. Here are some of the most popular ones:
- HDFS - Apache Hadoop Distributed File System
- Relational Databases like MySQL, PostgreSQL, Oracle, etc.
- Local File System
- Apache Kafka and Apache Flume
- NoSQL Databases like MongoDB
Flink can be deployed on a variety of platforms. Here are some of the most popular ones:
- You can deploy Apache Flink on a single machine.
- You can deploy Apache Flink on a cluster of machines using Apache Mesos or Apache YARN.
- You can deploy Apache Flink on a cloud platform like Amazon Web Services (AWS) and Google Cloud Platform (GCP).
This is arguably the most important feature of Apache Flink. There are different APIs that you can use to interact with Apache Flink.
High-level Analysis API: SQL & Table API. The Table API is a relational API that allows you to query data with SQL-like syntax. Both the SQL and Table API are used for both batch and streaming processing.
Stream and batch processing: DataStream API, which is used for handling streaming data where you can filter, map, aggregate the data.
You can run Apache Flink with cluster resource managers like Apache Mesos or Apache YARN or Kubernetes. But you could also run Flink as a standalone cluster.
The Flink cluster consists of two types of nodes:
- JobManager: The JobManager is responsible for managing the execution of jobs.
- TaskManager: The TaskManager is responsible for executing the tasks.
Let’s take a look at the components of the JobManager and TaskManager.
The JobManager has several components that are responsible for managing the execution of jobs.
- ResourceManager: The ResourceManager is responsible for managing the resources of the cluster.
- Dispatcher: The Dispatcher is responsible for dispatching the tasks to the TaskManagers.
- JobMaster: The JobMaster is responsible for managing the execution of jobs.
There would always be at least one JobManager in the cluster. For high availability, you can have multiple JobManagers.
The TaskManager is responsible for executing the tasks and is also called the Worker.
As with the JobManager, there would always be at least one TaskManager in the cluster.
Task Slots are the smallest unit of work that a TaskManager can handle. A TaskManager can have multiple Task Slots.
Flink in Depth
Go deeper into Flink via the following articles.
Below is a curated list of high-quality external resources on the topic of Flink.