[SPARK] Spark的編譯環境: 以SparkPi為例

上一篇文章中, 我們介紹了如何安裝單機版的Spark,
接下來, 為了要撰寫Spark程式, 我們要先建立Spark的編譯環境,
此時, 就要用到SBT這個套件, SBT (Simple Build Tool) 是Scala的編譯器,
有興趣可以參考: https://www.scala-sbt.org/

接下來, 我們就要重新創一個專案,
重新執行之前範例中的SparkPi, 我們主要參考以下連結:
https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_app_example.html
因為有一些修改, 就把指令紀錄如下:

一開始, 先建立專案的資料夾, 把SparkPi移入,
mkdir -p ~/spark-submit/project
mkdir -p ~/spark-submit/src/main/scal
cp /usr/lib/spark/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala ~/spark-submit/src/main/scala/SparkPi.scala

接著, 設定SBT的環境:
vim ~/spark-submit-example/build.sbt
貼上以下內容 (空行必須保留, Scala的版本號記得修改):
name := "SparkPi"

version := "1.0"

scalaVersion := "2.11.6"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.2"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.2"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

輸入SBT的對應版本號, 這裡是寫0.13.2,
截至目前為止2018/2/19, 最新版應該是1.1.1, 但是此處先照舊,
cat <<EOT > ~/spark-submit/project/build.properties
sbt.version=0.13.2
EOT

以下三行指令進行編譯, 必須先進行編譯後才能執行,
如果是第一次編譯會下載很多關聯檔案 (dependency), 大約要花1小時的時間
cd ~/spark-submit
sbt compile //compile the project
sbt run //run the project
sbt package //package the project

關於SBT編譯的說明可以參考這一篇文章:
https://alvinalexander.com/scala/sbt-how-to-compile-run-package-scala-project
該篇文章以Scala為例, 比較容易了解SBT和Scala之間的關係,
編譯好的Jar檔案在 ~/spark-submit-example/target/scala-2.10/sparkpi_2.10-1.0.jar
接著, 只要執行Jar檔就可以了, 指令如下:
spark-submit ~/spark-submit-example/target/scala-2.10/sparkpi_2.10-1.0.jar
在spark-submit時, 可以給予不同參數, 請參考:
https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_submit_example.html#example-running-a-spark-application-with-optional-parameters

如果出現 "ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration"
是因為在原本宣告中, 不帶有本機執行的資訊,
請修改原始程式的這個部分, 並重新執行:
val spark = SparkSession
   .builder
   .appName("Spark Pi")
   .config("spark.master", "local")
   .getOrCreate()
結果, 顯示如下:
~/submit$ sbt compile
[info] Loading project definition from /home/ubuntu/submit/project
[info] Loading settings for project submit from build.sbt ...
[info] Set current project to SparkPi (in build file:/home/ubuntu/submit/)
[info] Executing in batch mode. For better performance use sbt's shell
[info] Compiling 1 Scala source to /home/ubuntu/submit/target/scala-2.11/classes ...
[info] Done compiling.
[success] Total time: 5 s, completed Oct 29, 2018 10:20:54 AM
ubuntu@testspark:~/submit$ sbt run
[info] Loading project definition from /home/ubuntu/submit/project
[info] Loading settings for project submit from build.sbt ...
[info] Set current project to SparkPi (in build file:/home/ubuntu/submit/)
[info] Packaging /home/ubuntu/submit/target/scala-2.11/sparkpi_2.11-1.0.jar ...
[info] Done packaging.
[info] Running org.apache.spark.examples.SparkPi
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/10/29 10:21:17 INFO SparkContext: Running Spark version 2.1.2
18/10/29 10:21:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/29 10:21:18 INFO SecurityManager: Changing view acls to: ubuntu
18/10/29 10:21:18 INFO SecurityManager: Changing modify acls to: ubuntu
18/10/29 10:21:18 INFO SecurityManager: Changing view acls groups to:
18/10/29 10:21:18 INFO SecurityManager: Changing modify acls groups to:
18/10/29 10:21:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(ubuntu); groups with modify permissions: Set()
18/10/29 10:21:18 INFO Utils: Successfully started service 'sparkDriver' on port 44564.
18/10/29 10:21:18 INFO SparkEnv: Registering MapOutputTracker
18/10/29 10:21:18 INFO SparkEnv: Registering BlockManagerMaster
18/10/29 10:21:18 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/10/29 10:21:18 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/10/29 10:21:18 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bb299189-1de7-4ea8-a92a-a50e75b33c70
18/10/29 10:21:18 INFO MemoryStore: MemoryStore started with capacity 408.9 MB
18/10/29 10:21:18 INFO SparkEnv: Registering OutputCommitCoordinator
18/10/29 10:21:18 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/10/29 10:21:18 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.16.0.222:4040
18/10/29 10:21:19 INFO Executor: Starting executor ID driver on host localhost
18/10/29 10:21:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45212.
18/10/29 10:21:19 INFO NettyBlockTransferService: Server created on 172.16.0.222:45212
18/10/29 10:21:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/10/29 10:21:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.16.0.222, 45212, None)
18/10/29 10:21:19 INFO BlockManagerMasterEndpoint: Registering block manager 172.16.0.222:45212 with 408.9 MB RAM, BlockManagerId(driver, 172.16.0.222, 45212, None)
18/10/29 10:21:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.16.0.222, 45212, None)
18/10/29 10:21:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.16.0.222, 45212, None)
18/10/29 10:21:19 INFO SparkContext: Starting job: reduce at SparkPi.scala:39
18/10/29 10:21:19 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:39) with 2 output partitions
18/10/29 10:21:19 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:39)
18/10/29 10:21:19 INFO DAGScheduler: Parents of final stage: List()
18/10/29 10:21:19 INFO DAGScheduler: Missing parents: List()
18/10/29 10:21:19 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35), which has no missing parents
18/10/29 10:21:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 408.9 MB)
18/10/29 10:21:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1167.0 B, free 408.9 MB)
18/10/29 10:21:20 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:45212 (size: 1167.0 B, free: 408.9 MB)
18/10/29 10:21:20 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 10:21:20 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35)
18/10/29 10:21:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/10/29 10:21:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5953 bytes)
18/10/29 10:21:20 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 10:21:20 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
18/10/29 10:21:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5953 bytes)
18/10/29 10:21:20 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 10:21:20 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 123 ms on localhost (executor driver) (1/2)
18/10/29 10:21:20 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1041 bytes result sent to driver
18/10/29 10:21:20 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 37 ms on localhost (executor driver) (2/2)
18/10/29 10:21:20 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/10/29 10:21:20 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:39) finished in 0.170 s
18/10/29 10:21:20 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:39, took 0.410026 s
Pi is roughly 3.142555712778564
18/10/29 10:21:20 INFO SparkUI: Stopped Spark web UI at http://172.16.0.222:4040
18/10/29 10:21:20 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/10/29 10:21:20 INFO MemoryStore: MemoryStore cleared
18/10/29 10:21:20 INFO BlockManager: BlockManager stopped
18/10/29 10:21:20 INFO BlockManagerMaster: BlockManagerMaster stopped
18/10/29 10:21:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/10/29 10:21:20 INFO SparkContext: Successfully stopped SparkContext
[success] Total time: 5 s, completed Oct 29, 2018 10:21:20 AM
18/10/29 10:21:20 INFO ShutdownHookManager: Shutdown hook called
18/10/29 10:21:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-ed77a1be-cb46-41c9-834d-09d07a18ebbf
基本上和之前範例一致,
不過, 由於我們沒有指定由幾個tasker執行 (之前指定為10),
因此, 按照預設值分成2份平行執行



留言

熱門文章

LTE筆記: RSRP, RSSI and RSRQ

[WiFi] WiFi 網路的識別: BSS, ESS, SSID, ESSID, BSSID

LTE筆記: 波束成型 (beamforming) 和天線陣列