RDDs in Spark
Berkeley曾经发过一篇文章专门解释过RDD是什么。在Spark的源码中也有大段注释解释了什么是RDD。它是一个数据的抽象加上附在它身上的一些操作。以下截取一段RDD.scala中的注释:
Internally, each RDD is characterized by five main properties:
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list f preferred locations to compute each split on (e.g. block locations for an HDFS file)
本文主要列举各种RDD以及它们的生成方式。
###AsyncRDDActions
###BlockRDD
###CartesianRDD
###CheckpointRDD
###CoalescedRDD
CoGroupedRDD
A RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.
###DoubleRDDFunction
###EmptyRDD
###FilteredRDD
###FlatMappedRDD
###FlatMappedValuesRDD
###GlommedRDD
###HadoopRDD
在应用程序中调用SparkContext的textFile,从而调用hadoopFile被创建。
textFile: Read a text file from HDFS, a local system(available on all nodes), ir any Hadoop-supported file system URI, and return it as an RDD of Strings.
hadoopFile: Get an RDD for a Hadoop file with an arbitrary InputFormat 不可以直接cache,需要先map,否则会创建很多副本。
###JdbcRDD
###MapPartitionsRDD
MappedRDD
在通用RDD.map的时候被创建
map: Return a new RDD by applying a function to all elements of this RDD.
MappedValuesRDD
从PairRDDFunctions的mapValues创建。
###NewHadoopRDD
###OrderedRDDFunctions
PairRDDFunctions
Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
Import ‘org.apache.spark.SparkContext._’ at the top of your program to use these functions.
所有
combineByKey:
foldByKey:
reduceByKey:
groupByKey
join: Return an RDD containing all pairs of elements with matching keys in ‘this’ and ‘other’. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in ‘this’ and (k, v2) is in ‘other’. Uses the given Partitioner to partition the output RDD.
该函数调用this.cogroup.flatMapValues
cogroup: For each key k in ‘this’ or ‘other’, return a resulting RDD that contains a tuple with the list of values for that key in ‘this’ as well as ‘other’
该函数生成一个CoGroupedRDD
然后该函数会调用cg.mapValues(cg是一个CoGroupedRDD)
subtractByKey
mapValues: Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
这个函数会生成一个MappedValuesRDD
###ParallelCollectionRDD
在SparkContext中使用parallelize或者makeRDD创建
parallelize: Distribute a local Scala collection to form an RDD
makeRDD: Distribute a local Scala collection to form an RDD, with one or more location preferences(hostnames of Spark nodes) for each object. Create a new partition for each collection item.
SparkPi: 用Monte Carlo method求pi
其中map/reduce均为通用RDD定义的函数。前者生成一个MappedRDD。
###PartitionerAwareUnionRDD
###PartitionPruningRDD
###PartitionwiseSampledRDD
###PipedRDD
###RDDCheckpointData
###SampledRDD
###SequenceFileRDDFunctions
###ShuffledRDD
###SubtractedRDD
###UnionRDD
###ZippedPartitionsRDD
###ZippedRDD
###ZippedWithIndexRDD