Skip to content

Commit fa52fd6

Browse files
authored
Merge pull request scribd#71 from houqp/datadog-spark
splitting custom metrics and streaming metrics into two sections
2 parents eb96b5e + 8e05952 commit fa52fd6

File tree

1 file changed

+68
-46
lines changed

1 file changed

+68
-46
lines changed

_posts/2020-09-15-integrating-databricks-and-datadog.md

Lines changed: 68 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,8 @@ infrastructure.
2525

2626
## Configuring the Databricks cluster
2727

28-
When creating the cluster in Databricks, we use the following init script-based
29-
configuration to set up the Datadog agent. It also likely possible to set this
30-
up via [customized containers with Databricks Container
31-
Services](https://docs.databricks.com/clusters/custom-containers.html) but the
32-
`databricks` runtime images don't get updated as frequently as required for our
33-
purposes.
34-
35-
* Add cluster init script to setup datadog below
36-
* Set following environment variables for the cluster:
37-
* `ENVIRONMENT=development/staging/production`
38-
* `APP_NAME=your_spark_app_name`
39-
* `DATADOG_API_KEY=KEY`
40-
41-
All your Datadog metrics will be automatically tagged with `env` and `spark_app` tags.
42-
28+
When creating a cluster in Databricks, we setup and configure the Datadog
29+
agent with the following init script on the driver node:
4330

4431
```bash
4532
#!/bin/bash
@@ -50,11 +37,12 @@ All your Datadog metrics will be automatically tagged with `env` and `spark_app`
5037
# * ENVIRONMENT
5138
# * APP_NAME
5239

53-
echo "Setting up metrics for spark applicatin: ${APP_NAME}"
5440
echo "Running on the driver? $DB_IS_DRIVER"
55-
echo "Driver ip: $DB_DRIVER_IP"
5641

5742
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
43+
echo "Setting up metrics for spark applicatin: ${APP_NAME}"
44+
echo "Driver ip: $DB_DRIVER_IP"
45+
5846
cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties
5947
*.sink.statsd.host=${DB_DRIVER_IP}
6048
EOF
@@ -65,7 +53,6 @@ EOF
6553
DD_HOST_TAGS="[\"env:${ENVIRONMENT}\", \"spark_app:${APP_NAME}\"]" \
6654
bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/7.22.0/cmd/agent/install_script.sh)"
6755

68-
6956
cat << EOF >> /etc/datadog-agent/datadog.yaml
7057
use_dogstatsd: true
7158
# bind on all interfaces so it's accessible from executors
@@ -84,25 +71,31 @@ EOF
8471
fi
8572
```
8673

87-
Once the cluster has been launched with the appropriate Datadog agent support,
88-
we must then integrate a Statsd client into the Spark app itself.
74+
The cluster also needs to be launched with the following environment variables
75+
in order to configure the integration:
8976

90-
### Instrumenting Spark
77+
* `ENVIRONMENT=development/staging/production`
78+
* `APP_NAME=your_spark_app_name`
79+
* `DATADOG_API_KEY=KEY`
9180

92-
Integrating Statsd in Spark is _very_ simple, but for consistency we use a
93-
variant of the `Datadog` class listed below. Additionally, for Spark Streaming applications,
94-
the `Datadog` class also comes with a helper method that you can use to forward
95-
all the streaming progress metrics into Datadog:
9681

97-
```scala
98-
datadog.collectStreamsMetrics
99-
```
82+
Once the cluster has been fully configured with the above init script, you can
83+
then send metrics to Datadog from Spark through the statsd port exposed by the
84+
agent. All your Datadog metrics will be automatically tagged with `env` and
85+
`spark_app` tags.
10086

101-
By invoking this method, all streaming progress metrics will be tagged with `spark_app` and `label_name`
102-
tags. We use these streaming metrics to understand stream lag, issues with our
103-
batch sizes, and a number of other actionable metrics.
87+
In practice, you can setup all of this using DCS ([customized containers with
88+
Databricks
89+
Container Services](https://docs.databricks.com/clusters/custom-containers.html)) as well.
90+
But we decided against it in the end because we ran into many issues with DCS
91+
including out of date base images and lack of support for builtin cluster
92+
metrics.
10493

105-
And that’s it for the application setup!
94+
95+
### Sending custom metrics from Spark
96+
97+
Integrating Statsd with Spark is _very_ simple. To reduce boilerplate, we built
98+
an internal helper utility that wraps `timgroup.statsd` library:
10699

107100

108101
```scala
@@ -116,17 +109,6 @@ import scala.collection.JavaConverters._
116109
*
117110
* NOTE: this package relies on datadog agent to be installed and configured
118111
* properly on the driver node.
119-
*
120-
* == Example ==
121-
* implicit val spark = SparkSession.builder().getOrCreate()
122-
* val datadog = new Datadog(AppName)
123-
* // automatically forward spark streaming metrics to datadog
124-
* datadog.collectStreamsMetrics
125-
*
126-
* // you can use `datadog.statsdcli()` to create statsd clients from both driver
127-
* // and executors to emit custom emtrics
128-
* val statsd = datadog.statsdcli()
129-
* statsd.count(s"${AppName}.foo_counter", 100)
130112
*/
131113
class Datadog(val appName: String)(implicit spark: SparkSession) extends Serializable {
132114
val driverHost: String = spark.sparkContext.getConf
@@ -165,9 +147,49 @@ class Datadog(val appName: String)(implicit spark: SparkSession) extends Seriali
165147
}
166148
```
167149

168-
**Note:** : There is a known issue for Spark applications that exits
169-
immediately after an metric has been emitted. We still have some work to do in
170-
order to properly flush metrics before the application exits.
150+
To initializing the helper class takes two lines of code:
151+
152+
```scala
153+
implicit val spark = SparkSession.builder().getOrCreate()
154+
val datadog = new Datadog(AppName)
155+
```
156+
157+
Then you can use `datadog.statsdcli()` to create statsd clients from within
158+
both **driver** and **executors** to emit custom emtrics:
159+
160+
161+
```scala
162+
val statsd = datadog.statsdcli()
163+
statsd.count(s"${AppName}.foo_counter", 100)
164+
```
165+
166+
**Note:** : Datadog agent flushes metrics on a [preset
167+
interval](https://docs.datadoghq.com/developers/dogstatsd/data_aggregation/#how-is-aggregation-performed-with-the-dogstatsd-server)
168+
that can be configured from the init script. By default, it's 10 seconds. This
169+
means if your Spark application, running in a job cluster, exits immediately
170+
after a metric has been sent to Datadog agent, the agent won't have enough time
171+
to forward that metric to Datadog before the Databricks cluster shuts down. To
172+
address this issue, you need to put a manual sleep at the end of the Spark
173+
application so Datadog agent has enough time to flush the newly ingested
174+
metrics.
175+
176+
177+
### Instrumenting Spark streaming app
178+
179+
User of the Datadog helper class can also push all Spark streaming progress
180+
metrics to Datadog with one line of code:
181+
182+
```scala
183+
datadog.collectStreamsMetrics
184+
```
185+
186+
This method sets up a streaming query listener to collect streaming progress
187+
metrics and send them to the Datadog agent. All streaming progress metrics will
188+
be tagged with `spark_app` and `query_name` tags. We use these streaming
189+
metrics to monitor streaming lag, issues with our batch sizes, and a number
190+
of other actionable metrics.
191+
192+
And that’s it for the application setup!
171193

172194
---
173195

0 commit comments

Comments
 (0)