python – 配置Spark以使用Jupyter Notebook和Anaconda

我花了几天时间试图让Spark与我的Jupyter笔记本和Anaconda一起工作.这是我的.bash_profile的样子:

PATH="/my/path/to/anaconda3/bin:$PATH"

export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"

export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"

当我输入/my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell时,我可以在我的命令行shell中启动Spark.输出sc不是空的.它似乎工作正常.

当我键入pyspark时,它会启动我的Jupyter笔记本电脑.当我创建一个新的Python3笔记本时,会出现此错误:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:

我的Jupyter笔记本中的sc是空的.

有谁可以帮助解决这种情况?

只是想澄清一下:错误结束后冒号后面没有任何内容.我还尝试使用这个post创建我自己的启动文件,我在这里引用,所以你不必去看那里:

I created a short initialization script init_spark.py as follows:

06002

and placed it in the ~/.ipython/profile_default/startup/ directory

当我这样做时,错误变为:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:
Conda可以帮助正确管理很多依赖…

安装火花.假设在/ opt / spark中安装了spark,请将其包含在〜/ .bashrc中:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

除了spark之外,创建一个包含所有必需依赖项的conda环境:

conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0

激活环境

$source activate findspark-jupyter-openjdk8-py3

启动Jupyter Notebook服务器:

$jupyter notebook

在浏览器中,创建一个新的Python3笔记本

尝试使用以下脚本计算PI(从this借来)

import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
相关文章
相关标签/搜索