Using Apache Spark

Apache Spark is a data processing engine that performs processing tasks for big data workloads.

The Thrift JDBC/ODBC server corresponds to the HiveServer2 in built-in Hive. You can test the JDBC server with the beeline script that comes with either Spark or Hive. To connect to the Spark Thrift Server from any machine in a Big Data Service cluster, use the spark-beeline command.

Spark Configuration Properties

Spark configuration properties included in Big Data Service 3.1.1 or later.


Configuration	Property	Description
`spark3-env`	`spark_history_secure_opts`	Spark History Server Java options if security is enabled
	`spark_history_log_opts`	Spark History Server logging Java options
	`spark_thrift_log_opts`	Spark Thrift Server logging Java options
	`spark_library_path`	Paths containing shared libraries for Spark
	`spark_dist_classpath`	Paths containing Hadoop libraries for Spark
	`spark_thrift_remotejmx_opts`	Spark Thrift Server Java options if remote JMX is enabled
	`spark_history_remotejmx_opts`	Spark History Server Java options if remote JMX is enabled
`spark3-defaults`	`spark_history_store_path`	Location of the Spark History Server cache. To access this property, go to the Ambari home page, select Spark3, then select Configs, and finally select Advanced spark3-defaults. The default value is `/u01/lib/spark3/shs_db`. You can edit this value to change the cache location as needed.
`livy2-env`	`livy_server_opts`	Livy Server Java options

Group Permission to Download Policies

You can grant users access to download Ranger policies using a user group that allows running SQL queries through a Spark job.

In a Big Data Service HA cluster with the Ranger-Spark plugin enabled, you must have access to download Ranger policies to run any SQL queries using a Spark jobs. To grant permission to download Ranger policies, the user must be included in the policy.download.auth.users and tag.download.auth.users lists. For more information, see Spark Job Might Fail With a 401 Error While Trying to Download the Ranger-Spark Policies.

Instead of specifying many users, you can configure the policy.download.auth.groups parameter with a user group in the Spark-Ranger repository in the Ranger UI. This allows all users in that group to download Ranger policies and this feature is supported from ODH version 2.0.10 or later.

Example:

Access the Ranger UI.
Select Edit on the Spark repository.
Navigate to the Add New Configurations section.
Add or update policy.download.auth.groups with the user group.

Example:

policy.download.auth.groups = spark,testgroup
Select Save

Setting user-level permissions for Spark in Ranger

To manage which users can access Spark resources, set user-level permissions in the Ranger UI.

Access the Ranger Admin UI.
From the list of repositories, select the Spark service.
Select Add New Policy or select an existing policy to edit.
Select the resource (database, sparkservice, or other) you want to set permissions for.
In the Allow Conditions section, under Select User, select a user's name from the list. Then, under Permissions, select the permissions you want to grant that user.
Select Save.

Spark-Ranger Plugin Extension

The Spark-Ranger plugin extension can't be overridden at runtime in ODH version 2.0.10 or later.

Note

Fine-grained access control can't be fully enforced in non-Spark Thrift Server use cases through the Spark Ranger plugin. Ranger Admin is expected to grant required file access permissions to data in HDFS through HDFS ranger policies.

Oracle Cloud Infrastructure Documentation

Using Apache Spark

Spark Configuration Properties

Group Permission to Download Policies

Setting user-level permissions for Spark in Ranger

Spark-Ranger Plugin Extension