Skip to content

Conversation

@sabrenner
Copy link
Collaborator

What does this PR do?

Enables LLM Observability with agentless_enabled=True to ensure seamless compatibility with the agent used in the DD Extension layer. Previously, LLM Observability would try and use an agent proxy endpoint which doesn't exist on the trace agent used in the next version of the DD Extension layer.

Since all we really use the proxy for outside of serverless is so that users don't have to re-state their API key, it should be fine to just use agentless for serverless environments by default (as LLM Observability now also still sends APM traces even if agentless_enabled=True is set).

To help enable this, I added a call to init_api to get the DD_API_KEY from the secrets manager if it lives there, to make the experience even smoother. Otherwise, we can enforce DD_API_KEY in the lambda function's env vars

Motivation

MLOB-2225

Testing Guidelines

Built the layer with the build_layers script, and verified against our LLM Observability lambda functions with only

  • DD_LLMOBS_ENABLED
  • DD_LLMOBS_ML_APP
  • DD_API_KEY

set, and verified traces showed up in the UI.

Additional Notes

Happy to remove the init_api part if it would be too much of a burden on the code path for a serverless env. I saw it was available, and only used for metrics, but decided to re-use. It can be revisited later if needed and we can enforce DD_API_KEY being set directly instead.

Types of Changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog
  • This PR passes the integration tests (ask a Datadog member to run the tests)

@sabrenner sabrenner marked this pull request as ready for review March 10, 2025 17:54
@sabrenner sabrenner requested a review from a team as a code owner March 10, 2025 17:54
llmobs_api_key = None
llmobs_env_var = os.environ.get("DD_LLMOBS_ENABLED", "false").lower() in ("true", "1")
if llmobs_env_var:
from datadog_lambda.api import init_api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice use of lazy loading here!

Copy link
Contributor

@astuyve astuyve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, a few things to keep in mind:

  • if LLMobs ships data agentlessly via nonblocking thread or async runtime, you should expect to drop or miss data as the CPU is throttled and resumed between invocations. You may want to check this by performing a few invokes, then waiting several minutes. You can kind of handle this by aggressively retrying, but it's not perfect.
  • Because you're doing this in the main function process and not the extension, data will be lost if it's not flushed before the sandbox shuts down – as the main process is not executed on the shutdown event (but extensions are).
  • You should also expect to lose data in the event of a timeout, OOM, or other unhandled exceptions which cause the python process to re-initialize.

I'm not a codeowner of this project any longer, so I'll let that team chime in.

Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

After pairing on this, we discovered the impact in overhead for LLMObs customers.
The ballpark overhead increase for this use case would be 200ms (datadog.api import and secrets manager call).

So approximately ~800ms, as opposed to ~550ms.

LLMObs adds ~60ms during init.

@sabrenner sabrenner requested a review from a team as a code owner March 10, 2025 21:06
@sabrenner
Copy link
Collaborator Author

Thanks for the callouts @astuyve!

if LLMobs ships data agentlessly via nonblocking thread or async runtime, you should expect to drop or miss data as the CPU is throttled and resumed between invocations. You may want to check this by performing a few invokes, then waiting several minutes. You can kind of handle this by aggressively retrying, but it's not perfect.

Our LLMObs writer just re-uses the periodic HTTP writer service in dd-trace-py, which I think uses a synchronous blocking http request. It does run in a thread I think ( I haven't looked too deeply into the ddtrace internals), but I'll keep a note of this to see if it becomes an issue. I don't think our customers using LLMObs in serverless environments have too large of use cases to be hitting some kind of throttling. I'll run the tests, though, and follow up if I can make any improvements!

Because you're doing this in the main function process and not the extension, data will be lost if it's not flushed before the sandbox shuts down – as the main process is not executed on the shutdown event (but extensions are).

We force flush LLMObs in the datadog_lambda.wrapper._LambdaDecorator._after function, so I think we should be good in this case. This was something that got us in the past so we made sure to have that there 👍

You should also expect to lose data in the event of a timeout, OOM, or other unhandled exceptions which cause the python process to re-initialize.

This is a good point, which I think falls outside of this PR. I'll make a note to follow up on investigating this!

@sabrenner sabrenner merged commit 6e2e5d5 into main Mar 10, 2025
60 checks passed
@sabrenner sabrenner deleted the sabrenner/force-llmobs-agentless branch March 10, 2025 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants