pyspark2.x与pyspark3.x启动环境差异小记

  • 如果使用的spark版本是2.x,那么我们可以显式创建SparkContext并传递给SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

def get_spark_context():
    conf = SparkConf().setAppName("myApp")
    conf.set("spark.sql.execution.arrow.enabled", "False")
    sc = SparkContext(conf=conf)
    spark = SparkSession.builder.config(conf=conf) \
        .enableHiveSupport() \
        .getOrCreate()
    return sc, spark
  • 但如果使用的spark版本是3.x,我们应该避免显式创建SparkContext,因为 SparkSession会自动管理SparkContext
from pyspark.sql import SparkSession

def get_spark_session():
    spark = SparkSession.builder \
        .appName("myApp") \
        .config("spark.sql.execution.arrow.enabled", "true") \
        .enableHiveSupport() \
        .getOrCreate()
    return spark
  • 但在提交pythonpyspark脚本到spark集群时,只需要指定不同版本的spark即可,以spark3.1版本为例
/usr/lib/software/spark/spark-3.1/bin/wq-spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-cores 4 \
--executor-memory 2g \
--name your_name \
--queue your_executor_queue \
--conf spark.type=SPARKCORE \
--conf spark.yarn.report.interval=10000 \
--conf spark.dynamicAllocation.maxExecutors=200 \
--driver-memory 20g \
--conf spark.executor.memoryOverhead=8192 \
--conf spark.driver.maxResultSize=20g \
--conf spark.default.parallelism=6000 \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python_venv/py38/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./python_venv/py38/bin/python \
--archives=path/to/your/python_envs/py38.tar.gz#python_venv \
pyspark_script.py

pyspark2.x与pyspark3.x启动环境差异小记
https://www.lihaibao.cn/2024/06/26/pyspark2-x与pyspark3-x启动环境差异小记/
Author
Seal Li
Posted on
June 26, 2024
Licensed under