Skip to content

llama-model : add dots.llm1 architecture support (#14044) #14118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 15, 2025

Conversation

Noeda
Copy link
Contributor

@Noeda Noeda commented Jun 11, 2025

Add support for "dots.llm1" architecture. I decided to shorten that to dots1/DOTS1 in the code.

Tracking issue: #14044


These are the models that exist that use this:

There is also a paper: https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf (you can find this link in their Huggingface page).

And RedNote appears to have a GitHub page for this model as well: https://github.com/rednote-hilab/dots.llm1

The architecture has DeepseekV2+ MoE code but Qwen3 attention, kind of a mix:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

The model is 32k context MoE model at 142B total parameters, and 14B activated parameters. It has its own new chat template and token for them.

I think this is maybe the lab's very first model, I see absolutely no other history from them and I've never heard of them before. The model itself seems fairly ok, similar smarts to other recent local models of this kind of size, but not sure I dare to make strong claims of if it is good or not when my experience is purely anecdotal.


This PR has:

  1. Adding the various _DOTS1 constants across wherever new architecture code is added.
  2. Dots1Model introduced to convert_hf_to_gguf.py to convert the models.
  3. I added the chat template to llama-chat.cpp to use it for llama-server, following the Huggingface transformers code.

So far I've tested it empirically, and with some perplexity tests. Nothing seems totally off.

Some examples of prompting here: #14044 (comment)

Perplexity tests, see this comment: #14118 (comment)

For reference I used RedNote team's PR to transformers: huggingface/transformers#38143

@Noeda
Copy link
Contributor Author

Noeda commented Jun 11, 2025

Forgot to mention, @ddh0 made some quants, although I think right now you should run these with --override-kv tokenizer.ggml.eos_token_id=int:151649 because when I checked the metadata they had the wrong EOS token (the source safetensor files I presume was before the upstream team fixed their EOS token on Huggingface side):

https://huggingface.co/ddh0/dots.llm1.inst-GGUF-Q4_0-EXPERIMENTAL

@github-actions github-actions bot added the python python script changes label Jun 11, 2025
@Noeda Noeda force-pushed the dots1_squished_squashy branch from 1c1517b to 16dc0f4 Compare June 11, 2025 06:16
@Noeda
Copy link
Contributor Author

Noeda commented Jun 11, 2025

Force-pushed a tiny fix for a linter error ^

@jacekpoplawski
Copy link

jacekpoplawski commented Jun 11, 2025

I was able to run it:

~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "who are you?" 2>/dev/null
You are a helpful assistant.who are you?I'm dots, your AI assistant created by rednote-hilab! 🌟I'm here to help you with all kinds of questions—whether you need information, advice, or just someone to chat with. I can analyze documents, summarize text, explain concepts, and even brainstorm ideas. How can I assist you today? 😊
~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "list 10 AI companies" 2>/dev/null
You are a helpful assistant.list 10 AI companiesHere’s a list of **10 notable AI companies** (as of mid-2**2024**), spanning both well-established giants and innovative startups:

### **1. Big Tech (AI Leaders)**
1. **Google (Alphabet)** – DeepMind, Google AI, TensorFlow
2. **Microsoft** – Azure AI, Copilot, OpenAI partnership
3. **Meta (Facebook)** – FAIR, Llama models, AI research
4. **Amazon** – AWS AI/ML, Alexa, Bedrock (foundation models)
5. **Apple** – Core ML, Siri advancements, AI in AR/VR

### **2. OpenAI & AI Pioneers**
6. **OpenAI** – ChatGPT, GPT-4, DALL·E
7. **Anthropic** – Claude AI, safety-focused AI

### **3. AI Infrastructure & Tools**
8. **NVIDIA** – GPUs for AI, Omniverse, DGX systems
9. **Hugging Face** – Leader in open-source ML models (Transformers library)

### **4. Emerging/Vertical AI Startups**
10. **Runway** – GenAI for video/creative tools (used in Hollywood)

### **Honorable Mentions:**
- **Tesla** (Autopilot, Dojo supercomputing)
- **DeepMind** (separate from Google, but now integrated)
- **Cohere** (enterprise NLP)
- **Inflection AI** (Pi chatbot)

Would you like a focus on a specific niche (e.g., healthcare AI, autonomous systems)?
load_tensors: offloaded 50/63 layers to GPU
load_tensors:        CUDA0 model buffer size = 21158.12 MiB
load_tensors:        CUDA1 model buffer size = 21158.12 MiB
load_tensors:        CUDA2 model buffer size =  9956.76 MiB
load_tensors:        CUDA3 model buffer size =  9956.76 MiB
load_tensors:   CPU_Mapped model buffer size = 14620.13 MiB

(...)

llama_perf_sampler_print:    sampling time =      25.94 ms /   306 runs   (    0.08 ms per token, 11796.45 tokens per second)
llama_perf_context_print:        load time =   17196.71 ms
llama_perf_context_print: prompt eval time =     521.42 ms /    13 tokens (   40.11 ms per token,    24.93 tokens per second)
llama_perf_context_print:        eval time =   17150.21 ms /   292 runs   (   58.73 ms per token,    17.03 tokens per second)
llama_perf_context_print:       total time =   18636.09 ms /   305 tokens

@DocShotgun
Copy link
Contributor

DocShotgun commented Jun 12, 2025

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:

load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens

I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position:
image

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 12, 2025

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:

load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens

I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position: image

That does look pretty odd - when qwen-2 and other Chinese models do this, then the Chinese word they insert usually makes sense when you translate it, but this just looks garbled.

It could be an overflow maybe? IIRC, the qwen-2 architecture suffered really badly from overflows; both here in llama.cpp and for people trying to generate exllamav2 quants (usually in the last couple of layers the activations would grow larger than the range of FP16).

@DocShotgun
Copy link
Contributor

It would be interesting to see if the vllm/transformers implementation has any issues like this. The logprobs make it look like the model is absolutely baffled at that token position - as none of the options that show up there are sane for what should follow "smir" lol.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 12, 2025

@DocShotgun I'm jealous of the RTX PRO 6000 96gb!

I had a Max-Q version ordered since March, but after scan delayed it for the 4th time:

https://www.scan.co.uk/shop/computer-hardware/gpu-nvidia-workstation/nvidia-workstation-visualisation-graphics-cards

I cancelled it because it will be sod's law the MaxQ OEM will be the very last one they get :/

(just noticed the dates have all moved again to mid/late July now!)

@DocShotgun
Copy link
Contributor

Following up, I've spent about an hour testing on a hosted Q6_K endpoint, and thus far I haven't had any of those random Chinese character/nonsense word moments, even on prompts where the Q4_0 was triggering them fairly frequently.

@Noeda
Copy link
Contributor Author

Noeda commented Jun 13, 2025

@jukofyork @DocShotgun your tests have been purely on nvidia GPUS? I didn't notice anything weird earlier in my tests (also used Q4 mostly), wondering if it's another platform-specific thing (GLM-4 family had some tricky issues that seemingly were platform specific). I ran my earlier testing on Metal.

I'll be testing on a pure CPU; I forgot I have this 256GB Hetzner server with a nice modern AMD EPYC CPU and will address the review feedback and do further testing on this machine, likely this weekend at some point. I'll also go and test that weird token scenario if I can replicate it.

@Noeda Noeda force-pushed the dots1_squished_squashy branch 2 times, most recently from b6d1cb8 to 14fa155 Compare June 14, 2025 09:19
@Noeda
Copy link
Contributor Author

Noeda commented Jun 14, 2025

Addressed all the review feedback given so far.

Also:

  • Taught llama-model.cpp about LLM_TYPE_142B (it shows up in metadata output when you run llama.cpp but not sure it has other significance).

Review helped make convert_hf_to_gguf.py a bit simpler. I tested the changes end-to-end except for the very last push that had a small lint fix for the Python code, and removing a piece of code from constants.py that is likely unnecessary.

I'm going to run some basic perplexity checks and HF comparison tests and report back with that.

The changes made to address review feedback should not affect existing .ggufs in any way. If they do, then I think I probably messed up something.

@Noeda
Copy link
Contributor Author

Noeda commented Jun 14, 2025

Tested the latest push also end-to-end (end-to-end here meaning: convert from HF .safetensors -> bf16 .gguf -> q4 .gguf -> load up in llama-server or llama-cli and prompt it) and it all works.

Perplexity looks normal (takes a while to run but the initial numbers look normal to me so far):

perplexity: tokenizing the input ..
perplexity: tokenization took 540.975 ms
perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 40.81 seconds per pass - ETA 1 hours 39.30 minutes
[1]2.9578,[2]3.9821,[3]3.5785,[4]4.2661,[5]4.6334,[6]4.9597,[7]5.3189,[8]5.4915,[9]5.0049,[10]4.5051,[11]4.1832,[12]4.2304,[13]4.6438,[14]4.5120,[15]4.5578,[16]4.6986,[17]4.5273,[18]4.6576,[19]4.6883,[20]4.7281,[21]4.6465,[22]4.7138,[23]4.5798,[24]4.4258,

Running on pure CPU (AMD EPYC 9454P 48-Core Processor on /proc/cpuinfo). Using Q4_K for the above output and wikitext-2. I'm testing with the instruction model (dots.llm1.inst).

Edit: final result for Q4_K: Final estimate: PPL = 6.3931 +/- 0.04500.

Based on my HF testing where I was comparing logits instead between two implementations to try verify llama.cpp computation graph, I suspect Q8 would get quite a lot lower PPL score here.

@Noeda
Copy link
Contributor Author

Noeda commented Jun 14, 2025

I can't get the "smirpsilon" weird test case to show up at least with initial trying:

Next token probabilities for prompt: " smir" (Tokenization: " sm" -> 1525, "ir" -> 404) Screenshot 2025-06-14 at 13 08 13
Next token probabilities for prompt: "smir" (Tokenization: "sm" -> 3563, "ir" -> 404) Screenshot 2025-06-14 at 13 08 52

Although in these cases that is all I have for the prompt, just two tokens. I haven't seen random chinese characters around, quant is the same Q4_K I'm using in the perplexity test above.

@Noeda
Copy link
Contributor Author

Noeda commented Jun 14, 2025

Did some spot checking HF implementation vs llama.cpp implementation. The CPU is not exactly fast so hard to do a comprehensive test because running just one comparison takes a while so I settled to do just spot checking.

Empirically: Q8 is a lot closer to HF implementation than Q4. The HF implementation following RedNote's example implementation on their HuggingFace page outputs raw logits that are bfloat16 values. For example, in one test I got these raw logits out of the HF implementation (top 5 tokens):

token_id | raw logit
---------+-----------
279        54.25
17500      54.75
7682       56.5
1198       56.75
73774      57.0

Not sure that's normal or a mistake. I feel that bfloat16 probably should not be used at the output (and probably not float16 either) due to large number of logits vs low number of possible values these types can express. My machine can't load the model at float32 to do a better comparison test vs llama.cpp, but even with bfloat16 I don't see anything that obviously says the llama.cpp implementation is off, generally token probabilities agree. (cc @redmoe-moutain is the implementation at https://huggingface.co/rednote-hilab/dots.llm1.inst supposed to do output as bfloat16 in the example code? I was doing the comparisons based on the first implementation there in "Text Completion" using code from: huggingface/transformers#38143)

@Noeda
Copy link
Contributor Author

Noeda commented Jun 14, 2025

I think I'm done with the testing and the itches I've had so far. Can always test more, but I'm more confident than not that the graphs are correct 👍

Any other review comments, I can address :) if anyone sees anything off, or a test result says something weird.

@DocShotgun
Copy link
Contributor

I can't get the "smirpsilon" weird test case to show up at least with initial trying:

Next token probabilities for prompt: " smir" (Tokenization: " sm" -> 1525, "ir" -> 404)
Next token probabilities for prompt: "smir" (Tokenization: "sm" -> 3563, "ir" -> 404)
Although in these cases that is all I have for the prompt, just two tokens. I haven't seen random chinese characters around, quant is the same Q4_K I'm using in the perplexity test above.

Just a thing to note, the random gibberish tokens I ran into occurred on Q4_0 which is a significantly worse quant than Q4_K. I didn't have any issues when my friend hosted Q6_K.

Adds:

* Dots1Model to convert_hf_to_gguf.py

* Computation graph code to llama-model.cpp

* Chat template to llama-chat.cpp to detect this model's template.

---

The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

* https://huggingface.co/rednote-hilab/dots.llm1.inst

* https://huggingface.co/rednote-hilab/dots.llm1.base

The model architecture is a combination of Qwen and Deepseek parts, as
seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
@Noeda Noeda force-pushed the dots1_squished_squashy branch from 14fa155 to df0d4c3 Compare June 15, 2025 06:32
@CISC CISC merged commit 9ae4143 into ggml-org:master Jun 15, 2025
50 checks passed
@CISC CISC linked an issue Jun 15, 2025 that may be closed by this pull request
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: dots.llm1 model support
6 participants