parallel batching of large join requests #994

hzding621 · 2025-05-16T03:30:14Z

Summary

reduce batch size for really large batched join requests. there are sequential processing in fetchJoin kvstoreFuture.map{...} which becomes the bottleneck when the batch size is really large.

Why / Goal

latency reduction

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested

integration testing:

observed p99 latency reduction 600ms => 200ms (traffic: batch size ~150, ~17 join parts)

Checklist

Documentation update

Reviewers

@pengyu-hou @airbnb/airbnb-chronon-maintainers

nikhil-zlai · 2025-05-27T23:28:58Z

online/src/main/scala/ai/chronon/online/Fetcher.scala

@@ -234,6 +234,18 @@ class Fetcher(val kvStore: KVStore,
  }

  override def fetchJoin(requests: scala.collection.Seq[Request],


the behavior becomes backwards incompatible - shall we introduce a new method fetchJoinChunked(join, keys, chunkSize) instead?

Also it is better to be explicit with these.

i'm planning to make it a configurable parameter at Fetcher level.

~~@nikhil-zlai why not make the chunking behavior the new default?~~

updated to keep existing behavior, ptal

piyush-zlai · 2025-06-06T14:18:01Z

online/src/main/scala/ai/chronon/online/Fetcher.scala

@@ -94,7 +94,8 @@ class Fetcher(val kvStore: KVStore,
              callerName: String = null,
              flagStore: FlagStore = null,
              disableErrorThrows: Boolean = false,
-              executionContextOverride: ExecutionContext = null)
+              executionContextOverride: ExecutionContext = null,
+              joinFetchParallelChunkSize: Option[Int] = Some(32))


if we are going to keep this and not split like Nikhil mentioned, it might be good to keep the default behavior as before (unchunked) and allow clients to override if they want - wdyt?

Good point. updated!

piyush-zlai · 2025-06-06T14:19:40Z

online/src/main/scala/ai/chronon/online/Fetcher.scala

+      val batches = requests.grouped(joinFetchParallelChunkSize.get).toSeq
+      val batchFutures: Seq[Future[Seq[Response]]] =
+        batches.map(batch => doFetchJoin(batch, joinConf))
+      Future.sequence(batchFutures).map(_.flatten)


an idea here for the chunking might be to return individual chunked Futures rather than sequencing. With sequencing you are having to wait for all to get back, with individual futures you are getting a behavior akin to streaming responses back so the caller can start processing / acting on them

Valid point. This is what GPT said:

val batches = requests.grouped(joinFetchParallelChunkSize.get).toSeq // Return individual futures instead of a single combined future val responseFutures: Seq[Future[Seq[Response]]] = batches.map { batch => doFetchJoin(batch, joinConf) } // Don't use Future.sequence - instead, process each future individually responseFutures.foreach { future => future.foreach { responses => // Process each batch of responses as soon as it arrives processResponses(responses) // Your processing function } }

@piyush-zlai @pengyu-hou thanks for the suggestion! updated and added a separate entry point for a chunked api in both scala and java fetcher

so I included two flows:

existing fetchJoin will use Fetcher-level configuration (default to un-chunked). this is useful for clients who'd like to mostly keep existing behavior, but still able to enable chunking (but only at global level). I think Airbnb users will most likely use this flow.

new fetchJoinChunked for users who'd like to have fine-grained control. this API returns a Seq<Future<Seq<Response>>> for streaming-like behavior.

pengyu-hou · 2025-06-06T17:24:15Z

online/src/main/scala/ai/chronon/online/FetcherBase.scala

@@ -45,7 +45,8 @@ class FetcherBase(kvStore: KVStore,
                  debug: Boolean = false,
                  flagStore: FlagStore = null,
                  disableErrorThrows: Boolean = false,
-                  executionContextOverride: ExecutionContext = null)
+                  executionContextOverride: ExecutionContext = null,
+                  joinFetchParallelChunkSize: Option[Int] = Some(32))


should we use something like Option(System.getProperty("ai.chronon.fetcher. join_fetch_parallel_chunk_size")) to pass the default chunk size?

nikhil-zlai reviewed May 27, 2025

View reviewed changes

Haozhen Ding added 6 commits June 5, 2025 16:03

par map at post mussel

63cee2a

threshold

91ab531

revert

86ad216

batching

9a453b9

batching 2

7ca2b54

configurable

e730f3a

hzding621 force-pushed the haozhen--par-map-3 branch from fd097f5 to e730f3a Compare June 5, 2025 23:23

hzding621 marked this pull request as ready for review June 5, 2025 23:24

piyush-zlai reviewed Jun 6, 2025

View reviewed changes

pengyu-hou reviewed Jun 6, 2025

View reviewed changes

improve defaults

4803ae1

pengyu-hou approved these changes Jun 6, 2025

View reviewed changes

yuli-han approved these changes Jun 9, 2025

View reviewed changes

hzding621 merged commit 44cb75c into main Jun 10, 2025
7 checks passed

hzding621 deleted the haozhen--par-map-3 branch June 10, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallel batching of large join requests #994

parallel batching of large join requests #994

Uh oh!

hzding621 commented May 16, 2025 •

edited

Loading

Uh oh!

nikhil-zlai May 27, 2025 •

edited

Loading

Uh oh!

hzding621 Jun 4, 2025

Uh oh!

hzding621 Jun 5, 2025 •

edited

Loading

Uh oh!

hzding621 Jun 6, 2025

Uh oh!

piyush-zlai Jun 6, 2025

Uh oh!

hzding621 Jun 6, 2025

Uh oh!

piyush-zlai Jun 6, 2025

Uh oh!

pengyu-hou Jun 6, 2025

Uh oh!

hzding621 Jun 6, 2025

Uh oh!

hzding621 Jun 6, 2025 •

edited

Loading

Uh oh!

pengyu-hou Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -234,6 +234,18 @@ class Fetcher(val kvStore: KVStore,
		}

		override def fetchJoin(requests: scala.collection.Seq[Request],

parallel batching of large join requests #994

parallel batching of large join requests #994

Uh oh!

Conversation

hzding621 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / Goal

Test Plan

Checklist

Reviewers

Uh oh!

nikhil-zlai May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzding621 Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzding621 Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hzding621 commented May 16, 2025 •

edited

Loading

nikhil-zlai May 27, 2025 •

edited

Loading

hzding621 Jun 5, 2025 •

edited

Loading

hzding621 Jun 6, 2025 •

edited

Loading