pyspark2.x与pyspark3.x启动环境差异小记
- 如果使用的
spark
版本是2.x
,那么我们可以显式创建SparkContext
并传递给SparkSession
:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
def get_spark_context():
conf = SparkConf().setAppName("myApp")
conf.set("spark.sql.execution.arrow.enabled", "False")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf) \
.enableHiveSupport() \
.getOrCreate()
return sc, spark
- 但如果使用的
spark
版本是3.x
,我们应该避免显式创建SparkContext
,因为SparkSession
会自动管理SparkContext
:
from pyspark.sql import SparkSession
def get_spark_session():
spark = SparkSession.builder \
.appName("myApp") \
.config("spark.sql.execution.arrow.enabled", "true") \
.enableHiveSupport() \
.getOrCreate()
return spark
- 但在提交
python
的pyspark
脚本到spark
集群时,只需要指定不同版本的spark
即可,以spark3.1
版本为例
/usr/lib/software/spark/spark-3.1/bin/wq-spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-cores 4 \
--executor-memory 2g \
--name your_name \
--queue your_executor_queue \
--conf spark.type=SPARKCORE \
--conf spark.yarn.report.interval=10000 \
--conf spark.dynamicAllocation.maxExecutors=200 \
--driver-memory 20g \
--conf spark.executor.memoryOverhead=8192 \
--conf spark.driver.maxResultSize=20g \
--conf spark.default.parallelism=6000 \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python_venv/py38/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./python_venv/py38/bin/python \
--archives=path/to/your/python_envs/py38.tar.gz#python_venv \
pyspark_script.py
pyspark2.x与pyspark3.x启动环境差异小记
https://www.lihaibao.cn/2024/06/26/pyspark2-x与pyspark3-x启动环境差异小记/