The AI Rocket Ship
intelligence.join(analytics, on=["smarts"], how="full outer")
More notes on the Apache Spark 4.1 release

Collab
- Ankita Hatibaruah, LinkedIn
- Pavithra Ananthakrishnan, LinkedIn
- Sree Bhavya Kanduri, LinkedIn
Apache Spark 4.1 — Introduction
Apache Spark 4.1, released in December 2025, is a major update in the Spark 4.x series that enhances performance, usability, and developer productivity. Spark continues to serve as a unified analytics engine for large-scale, distributed data processing.
This release brings faster Python execution, improved SQL and streaming engines, better error handling, and lower-latency processing, making Spark 4.1 well-suited for both batch processing and real-time analytics.
Key Highlights in Spark 4.1
1.Spark Declarative Pipelines (SDP)
A new declarative framework where users define what datasets and queries should exist, and Spark manages how they execute.
Key capabilities:
- Define datasets and transformations declaratively
- Automatic execution graph & dependency ordering
- Built-in parallelism, checkpoints, and retries
- Author pipelines in Python and SQL
- Compile and run pipelines via CLI
- Integrates with Spark Connect for multi-language clients
Value:
Reduces orchestration complexity and boilerplate, enabling reliable, production-grade pipelines with minimal effort.
For detailed overview of SDP, refer to the document linked Spark Declarative Pipeline
2. Structured Streaming – Real-Time Mode (RTM)
First official support for real-time Structured Streaming with sub-second latency.
What’s supported in 4.1:
- Stateless, single-stage Scala queries
- Kafka sources
- Kafka and foreach sinks
- Continuous processing with single-digit millisecond latency for eligible workloads
Why it matters:
Enables near-instant data processing for use cases like fraud detection, monitoring, and alerting. Spark 4.1 establishes the foundation for broader RTM support in future releases.
3. PySpark UDFs & Python Data Sources
Significant performance and observability improvements for Python workloads.
Enhancements include:
Arrow-native UDF & UDTF decorators
- Executes directly on PyArrow
- Avoids Pandas conversion overhead
Python Data Source filter pushdown
- Reduces data movement
- Improves query efficiency
Python worker logging
- Captures logs from UDF execution
- Exposed via a built-in table-valued function
Outcome:
Faster execution, lower memory usage, and better debugging for PySpark applications.
4. Spark Connect Improvements
Spark Connect continues to mature as the default client-server architecture.
What’s new:
Spark ML on Connect is GA for Python
- Smarter model caching
- Improved memory management
Better stability for large workloads:
- Zstandard (zstd) compressed protobuf plans
- Chunked Arrow result streaming
- Improved handling of large local relations
Benefit:
More reliable and scalable remote execution for notebooks, services, and multi-language clients.
5. SQL Enhancements
Spark SQL sees major usability and performance upgrades.
Highlights:
SQL Scripting(enabled by default)
- Cleaner variable declarations
- Improved error handling
VARIANT data type
- Efficient shredding
- Faster reads for semi-structured data (JSON-like)
Recursive CTE support
New approximate data sketches
- KLL sketches
- Theta sketches
Impact:
More expressive SQL, better support for semi-structured data, and advanced analytical capabilities.