What is a DataFrame?

Real-time financial market data for stocks and trends.
Post Reply
Bappy10
Posts: 1288
Joined: Sat Dec 21, 2024 5:30 am

What is a DataFrame?

Post by Bappy10 »

RDD, or Resilient Distributed Dataset, is the fundamental data structure in Apache Spark. It is an immutable distributed collection of objects that is resilient to failures and can be operated on in parallel. RDDs are typically dataset used for low-level transformations and actions in Spark.
Example of RDD:
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)

A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R or Python. DataFrames are built on top of RDDs but provide a more user-friendly API for manipulating structured and semi-structured data.
Example of DataFrame:
val df = spark.read.json("examples/people.json")
df.show()

What is a Dataset?
A Dataset is a distributed collection of data with the benefits of both RDDs and DataFrames. Datasets are strongly typed, which means they offer the type safety of programming languages like Scala and Java, while also providing the benefits of the optimized execution engine of Spark SQL.
Post Reply