LLM Serving Stack Simulator

Setup

You can set up the simulator by installing its Python dependencies. We recommend starting with a fresh Python environment.

# Create and activate new Python environment
conda create -n sim python=3.11
conda activate sim

# Install dependencies
pip install -r requirements.txt

Inputs and Outputs

SplitwiseSim takes in a hierarchical set of YAML configuration files as input, and it produces several CSV files as output. It uses Hydra for configuration management. You can learn more about configuration management from the Hydra docs.

The top-level configuration file for SplitwiseSim is config.yaml, which points to lower-level configurations specified by other files in the configs/ directory. Specifically, config.yaml captures the following key components:

cluster: the provisioned server SKUs in the cluster, along with their respective counts.
trace: request trace that specifies the set of requests that arrive into the cluster.
router: the cluster-level router that routes incoming requests to application-level schedulers; currently a no-op.
arbiter: the cluster-level arbiter that manages compute resources between applications to support autoscaling; currently a no-op.
application: the logical endpoint that the requests target, which specifies the model and the set of instances on which the request runs; currently, we support only one application.
model_repo: the set of models (LLMs) available to run in the cluster; used for dynamic model instantiation.
orchestrator_repo: the set of application resource orchestrators (i.e., schedulers and allocators) in the cluster; used for dynamic application management.
hardware_repo: the set of available SKUs that can be provisioned in the cluster; used for dynamic server instantiation.
performance_model: an analytical model that helps estimate request runtimes with different batch, model, and hardware configurations.
start_state: starting state for the cluster, which helps simplify evaluation.

Several other aspects can be configured; please see config.yaml for details.

SplitwiseSim generates the following key outputs:

Summary of application-level metrics (summary.csv)
Per-request metrics for each completed request for each application (detailed/{application_id}.csv)
Request node-level metrics (request_nodes.csv)
Instance-level execution metrics (in instances/, with debug enabled)

We provide various utility functions to process outputs, as shown in notebooks/example.ipynb and notebooks/plots.ipynb.

How to run?

Simply modify config.yaml to and execute python run.py.

How to run experiments with separate scaling between prod/dev?

Use controller us3-dp in config.yaml
Use annotated traces ES_26_dp.csv in configs/trace/enterprise_sydney.yaml

How to configure other knobs?

The following knobs are present in config.yaml to freely allow changes in execution configuration.

feed_async: True/False to enable/disable the insertion of async requests whevnever memory utilisation falls below 0.5
feed_async_granularity: Specify the number of async requests to insert at a time.
scaling_level: Specify 0 for no scaling, 1 for scaling from/to spot only and 2 for inter model scaling along eith spot donations.
scaling_interval: The number of seconds to wait between two scaling events per model endpoint. Use -1 to disable this knob, i.e., no restriction on the number of scaling events w.r.t time.

Long term scaling

All short term scaling scripts should still run as they are!

Scaling on 1 hour window

Run with

python3 run_kunal.py trace.filename=ES_26 \
    short_term_scaling=False \
    long_term_scaling=True \
    global_arbiter.arima_traces=$PWD/traces/forecasts/ \
    global_arbiter.post_processing_strategy=<STRATEGY>

where STRATEGY can be immediate, delay_changes, keep_maximum_instances, keep_minimum_instances.

Reactive Scaling, Proactive Guidance

Run with

python3 run.py trace.filename=final_data_day_1 short_term_scaling=True long_term_scaling=True global_arbiter.arima_traces=$PWD/traces/forecasts/ controller.regions.0.arbiter=global_arbiter_ARIMA_checking controller.regions.1.arbiter=global_arbiter_ARIMA_checking controller.regions.2.arbiter=global_arbiter_ARIMA_checking global_arbiter.arima_aware_arbiter=True

python3 run.py trace.filename=final_data_day_1 short_term_scaling=True long_term_scaling=True global_arbiter.arima_traces=$PWD/traces/forecasts/ controller.regions.0.arbiter=global_arbiter_memory_utilization controller.regions.1.arbiter=global_arbiter_memory_utilization controller.regions.2.arbiter=global_arbiter_memory_utilization global_arbiter.arima_aware_arbiter=True

python3 run.py trace.filename=final_data_day_1 short_term_scaling=True long_term_scaling=True global_arbiter.arima_traces=$PWD/traces/forecasts/ controller.regions.0.arbiter=global_arbiter_short_term_scaling controller.regions.1.arbiter=global_arbiter_short_term_scaling controller.regions.2.arbiter=global_arbiter_short_term_scaling global_arbiter.arima_aware_arbiter=True

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
notebooks		notebooks
plotting_scripts		plotting_scripts
.gitignore		.gitignore
README.md		README.md
allocator.py		allocator.py
application.py		application.py
application_repo.py		application_repo.py
arbiter.py		arbiter.py
arbiter_repo.py		arbiter_repo.py
cluster_repo.py		cluster_repo.py
controller.py		controller.py
executor.py		executor.py
flow.py		flow.py
generate_random_trace.py		generate_random_trace.py
global_arbiter.py		global_arbiter.py
global_arbiter_repo.py		global_arbiter_repo.py
global_router.py		global_router.py
hardware_repo.py		hardware_repo.py
initialize.py		initialize.py
instance.py		instance.py
interconnect.py		interconnect.py
long_term_allocation.py		long_term_allocation.py
metrics.py		metrics.py
model.py		model.py
model_endpoint_repo.py		model_endpoint_repo.py
model_endpoint_router.py		model_endpoint_router.py
model_repo.py		model_repo.py
node.py		node.py
orchestrator_repo.py		orchestrator_repo.py
performance_model.py		performance_model.py
power_model.py		power_model.py
processor.py		processor.py
region.py		region.py
region_cluster.py		region_cluster.py
region_repo.py		region_repo.py
region_router.py		region_router.py
request.py		request.py
requirements.txt		requirements.txt
run.py		run.py
run_kunal.py		run_kunal.py
scheduler.py		scheduler.py
server.py		server.py
simulator.py		simulator.py
start_state.py		start_state.py
start_state_repo.py		start_state_repo.py
task.py		task.py
trace.py		trace.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Serving Stack Simulator

Setup

Inputs and Outputs

How to run?

How to run experiments with separate scaling between prod/dev?

How to configure other knobs?

Long term scaling

Scaling on 1 hour window

Reactive Scaling, Proactive Guidance

About

Uh oh!

Releases

Packages

Languages

shashwatj07/SageServe

Folders and files

Latest commit

History

Repository files navigation

LLM Serving Stack Simulator

Setup

Inputs and Outputs

How to run?

How to run experiments with separate scaling between prod/dev?

How to configure other knobs?

Long term scaling

Scaling on 1 hour window

Reactive Scaling, Proactive Guidance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages