An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
The formal definition is:
RDDs are fault-tolerant, parallel data structures that let users
explicitly persist intermediate results in memory, control their
partitioning to optimize data placement, and manipulate them using a
rich set of operators.
If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…