Streamlining with Apache Beam
LinkedIn achieved a 94% reduction in processing time by combining Batch and strEAM pipelines using Apache Beam. Beam introduced the concept of "windowing" to divide both unbounded and bounded data streams into manageable pieces for processing. By abstracting pipeline definition from its execution, Beam enables developers to write pipelines that work seamlessly on multiple distributed data processing platforms, whether batch or otherwise. Last week, LinkedIn's engineering team shared their experiences about leveraging Beam for improved efficiency. Read the story below.
How LinkedIn reduced processing time by 94% with Apache Beam
To support a use case which required a stream pipeline for real-time update and batch pipeline for periodic backfilling, LinkedIn was leveraging the Lambda architecture which required devs to maintain two different codebases. To mitigate this, it decided to create a single codebase using Apache Beam. Now stream jobs are handled by Apache Samza, and batch jobs by Spark. One of the complexities related to differing IO behaviors of stream and batch processing were abstracted with PTransform operations. Result: resource requirements for the workload cut by half, and processing time by 94%. Useful implementation details and architecture here. STORY