The AI Rocket Ship

intelligence.join(analytics, on=["smarts"], how="full outer")


More notes on the Apache Spark 4.1 release

More notes on the Apache Spark 4.1 release

Collab

  1. Ankita Hatibaruah, LinkedIn
  2. Pavithra Ananthakrishnan, LinkedIn
  3. Sree Bhavya Kanduri, LinkedIn

Apache Spark 4.1 — Introduction

Apache Spark 4.1, released in December 2025, is a major update in the Spark 4.x series that enhances performance, usability, and developer productivity. Spark continues to serve as a unified analytics engine for large-scale, distributed data processing. This release brings faster Python execution, improved SQL and streaming engines, better error handling, and lower-latency processing, making Spark 4.1 well-suited for both batch processing and real-time analytics.

Key Highlights in Spark 4.1

1.Spark Declarative Pipelines (SDP)

A new declarative framework where users define what datasets and queries should exist, and Spark manages how they execute.

Key capabilities:

Value:

Reduces orchestration complexity and boilerplate, enabling reliable, production-grade pipelines with minimal effort.

For detailed overview of SDP, refer to the document linked Spark Declarative Pipeline

2. Structured Streaming – Real-Time Mode (RTM)

First official support for real-time Structured Streaming with sub-second latency.

What’s supported in 4.1:

Why it matters:

Enables near-instant data processing for use cases like fraud detection, monitoring, and alerting. Spark 4.1 establishes the foundation for broader RTM support in future releases.

3. PySpark UDFs & Python Data Sources

Significant performance and observability improvements for Python workloads.

Enhancements include:

Arrow-native UDF & UDTF decorators
Python Data Source filter pushdown
Python worker logging

Outcome:

Faster execution, lower memory usage, and better debugging for PySpark applications.

4. Spark Connect Improvements

Spark Connect continues to mature as the default client-server architecture.

What’s new:

Spark ML on Connect is GA for Python
Better stability for large workloads:

Benefit:

More reliable and scalable remote execution for notebooks, services, and multi-language clients.

5. SQL Enhancements

Spark SQL sees major usability and performance upgrades.

Highlights:

SQL Scripting(enabled by default)
VARIANT data type
Recursive CTE support
New approximate data sketches

Impact:

More expressive SQL, better support for semi-structured data, and advanced analytical capabilities.


Back to Rocket Ship front page