1) install scala https://www.scala-lang.org/download/
2) install Jupyter `pip3 install jupyter`
3) download spark http://spark.apache.org/downloads.html
4) unzip spark to a folder like C:\spark
5) download winutils.exe from https://github.com/steveloughran/winutils (file in version_of_hadoop/bin)
6) Add SPARK_HOME=C:\spark in your environment, and add %C:\SPARK_HOME%\bin into your path
The following step are optional
7) download hadoop http://hadoop.apache.org/releases.html
8-10) repeat similar process as 5-7
11) install pyspark `pip install pyspark`
12) install findspark `pip3 install findspark`
13) Test:
import findspark import pyspark import random import datetime findspark.init() sc = pyspark.SparkContext(appName="Pi") num_samples = 100000000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 # spark
before = datetime.datetime.now() count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples after = datetime.datetime.now() print(pi) print(after - before) sc.stop() #no spark
before = datetime.datetime.now()
count = 0
for i in range(0, num_samples): if (inside(i)): count += 1
pi = 4 * count / num_samples after = datetime.datetime.now() print(pi) print(after - before)
References:
[1] http://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
[2] https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f
No comments:
Post a Comment