Skip to content

Setup default data for new instance#6077

Merged
Nutomic merged 18 commits intomainfrom
instance-setup-data
Oct 27, 2025
Merged

Setup default data for new instance#6077
Nutomic merged 18 commits intomainfrom
instance-setup-data

Conversation

@Nutomic
Copy link
Copy Markdown
Member

@Nutomic Nutomic commented Oct 14, 2025

  • Fetch a list of communities from lemmy.ml and embed it in the binary
  • On initial site setup fetch all these communities with most recent posts
  • In debug mode only load https://lemmy.ml/c/lemmy to reduce cpu/network usage while making sure it works
  • Also create a default community, and a sticky post with getting started info (link to docs, matrix, support etc)
  • On first day after new instance is created, fetch less data over federation (to reduce server load from fetching so many communities)

let community_ids = if env::var("OUT_DIR").unwrap() == "release" {
// fetch list of communities from lemmyverse.net
let mut communities: Vec<CommunityInfo> =
reqwest::blocking::get("https://data.lemmyverse.net/data/community.full.json")?.json()?;
Copy link
Copy Markdown
Member

@dessalines dessalines Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets please use your crawler, and not this service we have no control over. It only needs to be extended slightly to add communities.

Lets also not fetch this data within lemmy, but mount it as a git submodule. We could even transform the output JSON into its proper lemmy rust types with serde, spitting out a an .rs file, as a pre-compile task. But it'd also probably be fine to transform the json in code, as long as its only done on startup.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way build.rs works is that it runs during every Lemmy build. In release mode it fetches communities from lemmyverse.net, in debug it uses the hardcoded https://lemmy.ml/c/lemmy. It then writes the list to OUT_DIR which gets included in the final binary. Then while Lemmy is running it loads the same file back from inside the binary. So there is no need to generate a .rs file, it would be more complicated for no reason. Also if the request to lemmyverse.net fails during release build, the entire build fails so we can investigate and fix the problem.

Adding support for community crawling to the existing crawler would require a lot of work, and it would make the crawler use a lot more server resources. I dont have time to implement that, and it doesnt make sense if we can simply take the data from an existing crawler made for this purpose which is also open source. If anything I would consider mirroring the data from lemmyverse.net to join-lemmy.org or including it in the repo. But that would be more complicated and would require some kind of update task. So as long as there are no actual problems we should keep it simple like this.

Copy link
Copy Markdown
Member Author

@Nutomic Nutomic Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we dont need lemmyverse.net or any crawler for this, instead we can simply fetch the community list directly from lemmy.ml. Will change it like that shortly. In theory we could even do the same thing for the instance list on join-lemmy.org...

Edit: Done

@Nutomic Nutomic force-pushed the instance-setup-data branch from 0603782 to 0a7df45 Compare October 15, 2025 10:07
@Nutomic Nutomic marked this pull request as ready for review October 15, 2025 10:55
@Nutomic Nutomic force-pushed the instance-setup-data branch from c88ad77 to 7405cc9 Compare October 15, 2025 12:10
@Nutomic Nutomic force-pushed the instance-setup-data branch from b904421 to 92e2390 Compare October 15, 2025 13:02
Comment on lines +92 to +93
# Set this to true to start with an empty instance instead.
no_default_data: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to name this more explicitly: fill_default_federated_communities.

probably separating it from create_welcome_post.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The welcome post directly references the auto-fetched communities. So with separate config variables it would also need a different text in that case. Not worth the effort, but I can change the variable name or expand the comment if needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be

  • Extracted into its own git repo, that uses lemmy-client-rs to fetch and write this communities.json file.
  • We can run this whenever we want, like before a release.
  • Add this repo as a submodule, and put its git submodule path somewhere inside the crate that needs it. Alternatively you could define the communities.json location as an env var, and include it as a mounted file in the docker-compose.yml, so it can be read.
  • Read that file when creating the rows.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see what would be the benefit of all this extra complexity. If it requires another task to update then we will forget about that over time, and the list will get outdated. Right now it simply works, and if there is a problem we will notice it and will have time to fix it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright... the main thing that scares me is that it's relying on lemmy.ml being available during the build.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should usually be the case, if not we can simply restart the build. Or develop a different solution if it really turns out to be necessary.

Comment on lines +664 to +668
// Fetch communities themselves
let tasks = communities.iter().map(|c| async {
let context = context.reset_request_count();
c.dereference(&context).await.ok();
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually HTTP fetching data from maybe hundreds of communities?

Really we should only be creating instance and community rows locally, as this issue is about community discovery. IE instead of hammering these communities with fetches, we should be running Instance::read_or_create and Community::create with the supplied data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 communities in total, along with recent posts so that the All tab gets populated on a new instance. I added a check is_new_instance() which reduces the amount of data fetched, and this way it doesnt take so long.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of those http fetches? Don't we only need to fill the community table rows so that the communities are searchable? In that case we only need to do DB inserts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, we still need to get the data to insert from somewhere. Are you talking about embedding all that in the binary? Would be a lot more complicated to implement that way. And this doesnt only insert communities, but also the recent posts to have some initial content.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let communities_json = include_str!(concat!(env!("OUT_DIR"), "/communities.json"));
    let communities: Vec<ObjectId<ApubCommunity>> = serde_json::from_str(communities_json)?;

You're reading a communities.json file, which called ListCommunities already, and has all the info necessary to fill community DB rows. There's no reason to then fetch data, you already have it.

Fetching initial content for 1000s of communities could be burdensome on a lot of servers, its probably not a good idea, especially since this issue should only be about getting communities.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only fetches 50 communities (adjusted the text), and only when a new Lemmy server is created which is not that often. Compared to the normal data fetches done by any active Lemmy instance this is not much.

I tried to change it to embed a Vec<CommunityView> in the binary instead. But this would require two separate API requests in build.rs to fetch /c/announcements and /c/lemmy, as well as an API request in debug builds to fetch /c/lemmy for testing. It also would be more complicated to fetch recent posts this way as CommunityView doesnt store the outbox_url. Anyway the fetching is quite fast and works well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then, I spose its okay.

@dessalines
Copy link
Copy Markdown
Member

Merge whenever you like.

@Nutomic Nutomic merged commit 8c2303a into main Oct 27, 2025
2 checks passed
@Nutomic Nutomic deleted the instance-setup-data branch October 27, 2025 09:28
@thethunderwolf
Copy link
Copy Markdown

I know that this is old but hardcoding the instance doesn't seem very decentralized

especially concerning given the reputation that lemmy.ml has

@dessalines
Copy link
Copy Markdown
Member

Already superceded and made configurable by #6276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants