@@ -25,21 +25,8 @@ infrastructure.
25
25
26
26
## Configuring the Databricks cluster
27
27
28
- When creating the cluster in Databricks, we use the following init script-based
29
- configuration to set up the Datadog agent. It also likely possible to set this
30
- up via [ customized containers with Databricks Container
31
- Services] ( https://docs.databricks.com/clusters/custom-containers.html ) but the
32
- ` databricks ` runtime images don't get updated as frequently as required for our
33
- purposes.
34
-
35
- * Add cluster init script to setup datadog below
36
- * Set following environment variables for the cluster:
37
- * ` ENVIRONMENT=development/staging/production `
38
- * ` APP_NAME=your_spark_app_name `
39
- * ` DATADOG_API_KEY=KEY `
40
-
41
- All your Datadog metrics will be automatically tagged with ` env ` and ` spark_app ` tags.
42
-
28
+ When creating a cluster in Databricks, we setup and configure the Datadog
29
+ agent with the following init script on the driver node:
43
30
44
31
``` bash
45
32
#! /bin/bash
@@ -50,11 +37,12 @@ All your Datadog metrics will be automatically tagged with `env` and `spark_app`
50
37
# * ENVIRONMENT
51
38
# * APP_NAME
52
39
53
- echo " Setting up metrics for spark applicatin: ${APP_NAME} "
54
40
echo " Running on the driver? $DB_IS_DRIVER "
55
- echo " Driver ip: $DB_DRIVER_IP "
56
41
57
42
if [[ $DB_IS_DRIVER = " TRUE" ]]; then
43
+ echo " Setting up metrics for spark applicatin: ${APP_NAME} "
44
+ echo " Driver ip: $DB_DRIVER_IP "
45
+
58
46
cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties
59
47
*.sink.statsd.host=${DB_DRIVER_IP}
60
48
EOF
65
53
DD_HOST_TAGS=" [\" env:${ENVIRONMENT} \" , \" spark_app:${APP_NAME} \" ]" \
66
54
bash -c " $( curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/7.22.0/cmd/agent/install_script.sh) "
67
55
68
-
69
56
cat << EOF >> /etc/datadog-agent/datadog.yaml
70
57
use_dogstatsd: true
71
58
# bind on all interfaces so it's accessible from executors
84
71
fi
85
72
```
86
73
87
- Once the cluster has been launched with the appropriate Datadog agent support,
88
- we must then integrate a Statsd client into the Spark app itself.
74
+ The cluster also needs to be launched with the following environment variables
75
+ in order to configure the integration:
89
76
90
- ### Instrumenting Spark
77
+ * ` ENVIRONMENT=development/staging/production `
78
+ * ` APP_NAME=your_spark_app_name `
79
+ * ` DATADOG_API_KEY=KEY `
91
80
92
- Integrating Statsd in Spark is _ very_ simple, but for consistency we use a
93
- variant of the ` Datadog ` class listed below. Additionally, for Spark Streaming applications,
94
- the ` Datadog ` class also comes with a helper method that you can use to forward
95
- all the streaming progress metrics into Datadog:
96
81
97
- ``` scala
98
- datadog.collectStreamsMetrics
99
- ```
82
+ Once the cluster has been fully configured with the above init script, you can
83
+ then send metrics to Datadog from Spark through the statsd port exposed by the
84
+ agent. All your Datadog metrics will be automatically tagged with ` env ` and
85
+ ` spark_app ` tags.
100
86
101
- By invoking this method, all streaming progress metrics will be tagged with ` spark_app ` and ` label_name `
102
- tags. We use these streaming metrics to understand stream lag, issues with our
103
- batch sizes, and a number of other actionable metrics.
87
+ In practice, you can setup all of this using DCS ([ customized containers with
88
+ Databricks
89
+ Container Services] ( https://docs.databricks.com/clusters/custom-containers.html ) ) as well.
90
+ But we decided against it in the end because we ran into many issues with DCS
91
+ including out of date base images and lack of support for builtin cluster
92
+ metrics.
104
93
105
- And that’s it for the application setup!
94
+
95
+ ### Sending custom metrics from Spark
96
+
97
+ Integrating Statsd with Spark is _ very_ simple. To reduce boilerplate, we built
98
+ an internal helper utility that wraps ` timgroup.statsd ` library:
106
99
107
100
108
101
``` scala
@@ -116,17 +109,6 @@ import scala.collection.JavaConverters._
116
109
*
117
110
* NOTE: this package relies on datadog agent to be installed and configured
118
111
* properly on the driver node.
119
- *
120
- * == Example ==
121
- * implicit val spark = SparkSession.builder().getOrCreate()
122
- * val datadog = new Datadog(AppName)
123
- * // automatically forward spark streaming metrics to datadog
124
- * datadog.collectStreamsMetrics
125
- *
126
- * // you can use `datadog.statsdcli()` to create statsd clients from both driver
127
- * // and executors to emit custom emtrics
128
- * val statsd = datadog.statsdcli()
129
- * statsd.count(s"${AppName}.foo_counter", 100)
130
112
*/
131
113
class Datadog (val appName : String )(implicit spark : SparkSession ) extends Serializable {
132
114
val driverHost : String = spark.sparkContext.getConf
@@ -165,9 +147,49 @@ class Datadog(val appName: String)(implicit spark: SparkSession) extends Seriali
165
147
}
166
148
```
167
149
168
- ** Note:** : There is a known issue for Spark applications that exits
169
- immediately after an metric has been emitted. We still have some work to do in
170
- order to properly flush metrics before the application exits.
150
+ To initializing the helper class takes two lines of code:
151
+
152
+ ``` scala
153
+ implicit val spark = SparkSession .builder().getOrCreate()
154
+ val datadog = new Datadog (AppName )
155
+ ```
156
+
157
+ Then you can use ` datadog.statsdcli() ` to create statsd clients from within
158
+ both ** driver** and ** executors** to emit custom emtrics:
159
+
160
+
161
+ ``` scala
162
+ val statsd = datadog.statsdcli()
163
+ statsd.count(s " ${AppName }.foo_counter " , 100 )
164
+ ```
165
+
166
+ ** Note:** : Datadog agent flushes metrics on a [ preset
167
+ interval] ( https://docs.datadoghq.com/developers/dogstatsd/data_aggregation/#how-is-aggregation-performed-with-the-dogstatsd-server )
168
+ that can be configured from the init script. By default, it's 10 seconds. This
169
+ means if your Spark application, running in a job cluster, exits immediately
170
+ after a metric has been sent to Datadog agent, the agent won't have enough time
171
+ to forward that metric to Datadog before the Databricks cluster shuts down. To
172
+ address this issue, you need to put a manual sleep at the end of the Spark
173
+ application so Datadog agent has enough time to flush the newly ingested
174
+ metrics.
175
+
176
+
177
+ ### Instrumenting Spark streaming app
178
+
179
+ User of the Datadog helper class can also push all Spark streaming progress
180
+ metrics to Datadog with one line of code:
181
+
182
+ ``` scala
183
+ datadog.collectStreamsMetrics
184
+ ```
185
+
186
+ This method sets up a streaming query listener to collect streaming progress
187
+ metrics and send them to the Datadog agent. All streaming progress metrics will
188
+ be tagged with ` spark_app ` and ` query_name ` tags. We use these streaming
189
+ metrics to monitor streaming lag, issues with our batch sizes, and a number
190
+ of other actionable metrics.
191
+
192
+ And that’s it for the application setup!
171
193
172
194
---
173
195
0 commit comments