Skip to content

Commit 5dcd02d

Browse files
authored
Merge branch 'site' into fix-previous-versions
2 parents b9bfd20 + 2556662 commit 5dcd02d

File tree

4 files changed

+87
-6
lines changed

4 files changed

+87
-6
lines changed

_community_blog/vllm-joins-pytorch.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: "vLLM Joins PyTorch Ecosystem: Easy, Fast, and Cheap LLM Serving for Everyone"
3+
author: vLLM Team
4+
ext_url: /blog/vllm-joins-pytorch/
5+
date: Dec 9, 2024
6+
---
7+
8+
We’re thrilled to announce that the [vLLM project](https://github.com/vllm-project/vllm) has become a PyTorch ecosystem project, and joined the PyTorch ecosystem family!
9+
10+
Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in — a high-throughput, memory-efficient inference and serving engine designed for LLMs.

_includes/podcast.html

-6
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,5 @@ <h1>PyTorch Developer Podcast</h1>
2222
<a href="{{ site.external_urls.amazon }}">Subscribe Here</a>
2323
</div>
2424
</div>
25-
<div class="col-md-3 podcast-card">
26-
<div class="podcast-info-container">
27-
<p class="podcast-title">Google</p>
28-
<a href="{{ site.external_urls.google }}">Subscribe Here</a>
29-
</div>
30-
</div>
3125
</div>
3226
</div>
+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
layout: blog_detail
3+
title: "vLLM Joins PyTorch Ecosystem: Easy, Fast, and Cheap LLM Serving for Everyone"
4+
author: vLLM Team
5+
hidden: true
6+
---
7+
8+
![vllm logo](/assets/images/vllm.png){:style="width:100%;display: block;max-width:400px; margin-left:auto; margin-right:auto;"}
9+
10+
We’re thrilled to announce that the [vLLM project](https://github.com/vllm-project/vllm) has become a PyTorch ecosystem project, and joined the PyTorch ecosystem family!
11+
12+
For more information on what it means to be a PyTorch ecosystem project, see the [PyTorch Ecosystem Tools page](https://pytorch.org/ecosystem/).
13+
14+
Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in — a high-throughput, memory-efficient inference and serving engine designed for LLMs.
15+
16+
Originally built around the innovative [PagedAttention algorithm](https://arxiv.org/abs/2309.06180), vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.
17+
18+
Since its release, vLLM has garnered significant attention, achieving over 31,000 GitHub stars—a testament to its popularity and thriving community. This milestone marks an exciting chapter for vLLM as we continue to empower developers and researchers with cutting-edge tools for efficient and scalable AI deployment. Welcome to the next era of LLM inference!
19+
20+
vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms.
21+
22+
Do you know you can experience the power of vLLM right from your phone? During this year’s Amazon Prime Day, vLLM played a crucial role in [delivering lightning-fast responses to millions of users](https://aws.amazon.com/cn/blogs/machine-learning/scaling-rufus-the-amazon-generative-ai-powered-conversational-shopping-assistant-with-over-80000-aws-inferentia-and-aws-trainium-chips-for-prime-day/). Across three regions, over 80,000 Trainium and Inferentia chips powered an average of 3 million tokens per minute, all while maintaining a P99 latency of less than 1 second for the first response. That means when customers opened the Amazon app and chatted with Rufus, they were seamlessly interacting with vLLM in action!
23+
24+
vLLM also collaborates tightly with leading model vendors to ensure support for popular models. This includes tight integration with Meta LLAMA, Mistral, QWen, and DeepSeek models, plus many others. One particularly memorable milestone was the [release of LLAMA 3.1 (405B)](https://ai.meta.com/blog/meta-llama-3-1/). As the launching partner, vLLM was the first to enable running this very large model, showcasing vLLM’s capability to handle the most complex and resource-intensive language models.
25+
26+
To install vLLM, simply run:
27+
28+
29+
```
30+
pip install vllm
31+
```
32+
33+
34+
vLLM is designed for both researchers and production-grade serving.
35+
36+
To run vLLM as an OpenAI API compatible server, just use the Huggingface model ID:
37+
38+
39+
```
40+
vllm serve meta-llama/Llama-3.1-8B
41+
```
42+
43+
44+
To run vLLM as a simple function:
45+
46+
47+
```
48+
from vllm import LLM, SamplingParams
49+
50+
# Sample prompts.
51+
prompts = [
52+
"Hello, my name is",
53+
"The president of the United States is",
54+
"The capital of France is",
55+
"The future of AI is",
56+
]
57+
# Create a sampling params object.
58+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
59+
60+
# Create an LLM.
61+
llm = LLM(model="meta-llama/Llama-3.1-8B")
62+
# Generate texts from the prompts. The output is a list of RequestOutput objects
63+
# that contain the prompt, generated text, and other information.
64+
outputs = llm.generate(prompts, sampling_params)
65+
# Print the outputs.
66+
for output in outputs:
67+
prompt = output.prompt
68+
generated_text = output.outputs[0].text
69+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
70+
```
71+
72+
73+
Open-source innovation is part of the vLLM’s DNA. Born out of a Berkeley academic project, it follows the legacy of other pioneering open-source initiatives such as BSD, which revolutionized operating systems in the 1980s. Other innovations from the same organization include [Apache Spark](https://github.com/apache/spark) and [Ray](https://github.com/ray-project/ray), now the standard for big data and AI systems. In the Gen AI era, vLLM serves as a platform dedicated to democratizing AI inference.
74+
75+
The vLLM team remains steadfast in its mission to keep the project “of the community, by the community, and for the community.” Collaboration and inclusivity lie at the heart of everything we do.
76+
77+
If you have collaboration requests or inquiries, feel free to reach out at [[email protected]](mailto:[email protected]). To join the active and growing vLLM community, explore our [GitHub repository](https://github.com/vllm-project/vllm) or connect with us on the [vLLM Slack](https://slack.vllm.ai). Together, we can push the boundaries of AI innovation and make it accessible to all.

assets/images/vllm.png

32 KB
Loading

0 commit comments

Comments
 (0)