Using Apache Spark

Apache Spark is a data processing engine that performs processing tasks for big data workloads.

The Thrift JDBC/ODBC server corresponds to the HiveServer2 in built-in Hive. You can test the JDBC server with the beeline script that comes with either Spark or Hive. To connect to the Spark Thrift Server from any machine in a Big Data Service cluster, use the spark-beeline command.

Spark Configuration Properties

Spark configuration properties included in Big Data Service 3.1.1 or later.

Configuration Property Description
spark3-env spark_history_secure_opts Spark History Server Java options if security is enabled
spark_history_log_opts Spark History Server logging Java options
spark_thrift_log_opts Spark Thrift Server logging Java options
spark_library_path Paths containing shared libraries for Spark
spark_dist_classpath Paths containing Hadoop libraries for Spark
spark_thrift_remotejmx_opts Spark Thrift Server Java options if remote JMX is enabled
spark_history_remotejmx_opts Spark History Server Java options if remote JMX is enabled
spark3-defaultsspark_history_store_pathLocation of the Spark History Server cache. To access this property, go to the Ambari home page, select Spark3, then select Configs, and finally select Advanced spark3-defaults.

The default value is /u01/lib/spark3/shs_db. You can edit this value to change the cache location as needed.

livy2-env livy_server_opts Livy Server Java options

Group Permission to Download Policies

You can grant users access to download Ranger policies using a user group that allows running SQL queries through a Spark job.

In a Big Data Service HA cluster with the Ranger-Spark plugin enabled, you must have access to download Ranger policies to run any SQL queries using a Spark jobs. To grant permission to download Ranger policies, the user must be included in the policy.download.auth.users and tag.download.auth.users lists. For more information, see Spark Job Might Fail With a 401 Error While Trying to Download the Ranger-Spark Policies.

Instead of specifying many users, you can configure the policy.download.auth.groups parameter with a user group in the Spark-Ranger repository in the Ranger UI. This allows all users in that group to download Ranger policies and this feature is supported from ODH version 2.0.10 or later.

Example:

  1. Access the Ranger UI.
  2. Select Edit on the Spark repository.
  3. Navigate to the Add New Configurations section.
  4. Add or update policy.download.auth.groups with the user group.

    Example:

    policy.download.auth.groups = spark,testgroup

  5. Select Save

Setting user-level permissions for Spark in Ranger

To manage which users can access Spark resources, set user-level permissions in the Ranger UI.

  1. Access the Ranger Admin UI.
  2. From the list of repositories, select the Spark service.
  3. Select Add New Policy or select an existing policy to edit.
  4. Select the resource (database, sparkservice, or other) you want to set permissions for.
  5. In the Allow Conditions section, under Select User, select a user's name from the list. Then, under Permissions, select the permissions you want to grant that user.
  6. Select Save.

Spark-Ranger Plugin Extension

The Spark-Ranger plugin extension can't be overridden at runtime in ODH version 2.0.10 or later.

Note

Fine-grained access control can't be fully enforced in non-Spark Thrift Server use cases through the Spark Ranger plugin. Ranger Admin is expected to grant required file access permissions to data in HDFS through HDFS ranger policies.