Skip to content

Conversation

ashokkumarrathore
Copy link

@ashokkumarrathore ashokkumarrathore commented Jan 1, 2025

What changes were proposed in this pull request?

Related issue : #461

How was this patch tested?

Tested by deploying updated server in a k8s cluster with namespaces and running a job.

@ashokkumarrathore ashokkumarrathore changed the title Livy nmsp fix Spark on k8s : changes for using namspaces with Kubernetes client APIs. Jan 1, 2025
private var sessionLeakageCheckInterval: Long = _

var kubernetesClient: DefaultKubernetesClient = _
var appNamespaces: mutable.Set[String] = mutable.Set("default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have a default namespaces at all? Can it be empty initially? Also is it thread safe (or can we make it so)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making configurable allowed namespaces list? We anyway know it upfront when configuring permissions for Live service account. It worths checking how the recovery works, I haven't looked at it for a while, but having Livy monitoring all the namespaces from the get go may save us from leaked/unrecovered apps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also do not need to propagate app namespace everywhere.
WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for inputs @jahstreet .

  1. If user does not specify a namespace, the job goes to default namespace. That's why we start with that. But we can remove it too. As it will anyway get populated after first job submission. For thread safety, i will switch it to java.util.concurrent.ConcurrentHashMap.
  2. The configurable namespace allowlist would work but that would need deployment everytime a new namespace is allowed. The rolebindings can be created in new namespaces(to allow job submission from livy service account) independently.
  3. The reason we are maintaining this list of namespaces is for leaked apps. The idea is, if we go through all namespaces where jobs were submitted, that should be enough.
  4. Livy monitoring all namespaces is not feasible due to security. In a multi tenant cluster, due to security, livy service account might not have any permissions on some namespaces.
  5. For recovery, we are also saving the namespace now in [server/src/main/scala/org/apache/livy/server/batch/BatchSession.scala] which will be used to recover.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it covers most of my thoughts 👍 .

In case namespaces are dynamic, should we also truncate the set of watched namespaces to avoid revoked access rights to the namespaces and namespace deletion edge cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The challenge would be to when to truncate. One condition i can think of is, just after deleting all leaked apps. In a rare case this might still leave some leaked apps(apps which were just added to leaked list).I will see what other options we have.

I still need to write some unit tests. But before that i want to make sure existing tests pass. How do i get this workflow approved from maintainer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyogal can u help here please?

Copy link
Contributor

@gyogal gyogal Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ashokkumarrathore for the PR and many thanks also to @askhatri, @jahstreet for the feedback! The unit test executor is currently broken and I am in the process of fixing it. As soon as it is working again, I will start the check jobs on this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gyogal, do you have any updates on this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test jobs are now fixed, however the PR needs to be rebased to the end of the branch so that they can be started. There are also some new conflicts. Would it be possible to rebase this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants