Sunday, October 29, 2017

Install Spark on Windows 10

0) Make sure you have python3.6 and Java8
1) install scala https://www.scala-lang.org/download/
2) install Jupyter `pip3 install jupyter`
3) download spark http://spark.apache.org/downloads.html
4) unzip spark to a folder like C:\spark
5) download winutils.exe from https://github.com/steveloughran/winutils (file in version_of_hadoop/bin)
6) Add SPARK_HOME=C:\spark in your environment, and add %C:\SPARK_HOME%\bin into your path

The following step are optional
7) download hadoop http://hadoop.apache.org/releases.html
8-10) repeat similar process as 5-7
11) install pyspark `pip install pyspark`
12) install findspark `pip3 install findspark`
13) Test:
import findspark
import pyspark
import random
import datetime

findspark.init()
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1
# spark
before = datetime.datetime.now()
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
after = datetime.datetime.now()
print(pi)
print(after - before)
sc.stop()

#no spark
before = datetime.datetime.now()
count = 0
for i in range(0, num_samples):
  if (inside(i)):
    count += 1
pi = 4 * count / num_samples
after = datetime.datetime.now()
print(pi)
print(after - before)

References:
[1] http://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
[2] https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f

No comments:

Post a Comment