We note that you use the features of last 4 layers from the encoder instead of intermediate layers (e.g. [5, 12, 18, 24] for vitl) as in some other works such as DINOv2. What's the reason for that and is there any remarkable difference between these two strategy?