Conversation
crates/api/api_crud/build.rs
Outdated
| let community_ids = if env::var("OUT_DIR").unwrap() == "release" { | ||
| // fetch list of communities from lemmyverse.net | ||
| let mut communities: Vec<CommunityInfo> = | ||
| reqwest::blocking::get("https://data.lemmyverse.net/data/community.full.json")?.json()?; |
There was a problem hiding this comment.
Lets please use your crawler, and not this service we have no control over. It only needs to be extended slightly to add communities.
Lets also not fetch this data within lemmy, but mount it as a git submodule. We could even transform the output JSON into its proper lemmy rust types with serde, spitting out a an .rs file, as a pre-compile task. But it'd also probably be fine to transform the json in code, as long as its only done on startup.
There was a problem hiding this comment.
The way build.rs works is that it runs during every Lemmy build. In release mode it fetches communities from lemmyverse.net, in debug it uses the hardcoded https://lemmy.ml/c/lemmy. It then writes the list to OUT_DIR which gets included in the final binary. Then while Lemmy is running it loads the same file back from inside the binary. So there is no need to generate a .rs file, it would be more complicated for no reason. Also if the request to lemmyverse.net fails during release build, the entire build fails so we can investigate and fix the problem.
Adding support for community crawling to the existing crawler would require a lot of work, and it would make the crawler use a lot more server resources. I dont have time to implement that, and it doesnt make sense if we can simply take the data from an existing crawler made for this purpose which is also open source. If anything I would consider mirroring the data from lemmyverse.net to join-lemmy.org or including it in the repo. But that would be more complicated and would require some kind of update task. So as long as there are no actual problems we should keep it simple like this.
There was a problem hiding this comment.
Actually we dont need lemmyverse.net or any crawler for this, instead we can simply fetch the community list directly from lemmy.ml. Will change it like that shortly. In theory we could even do the same thing for the instance list on join-lemmy.org...
Edit: Done
0603782 to
0a7df45
Compare
c88ad77 to
7405cc9
Compare
b904421 to
92e2390
Compare
| # Set this to true to start with an empty instance instead. | ||
| no_default_data: true |
There was a problem hiding this comment.
I think it'd be better to name this more explicitly: fill_default_federated_communities.
probably separating it from create_welcome_post.
There was a problem hiding this comment.
The welcome post directly references the auto-fetched communities. So with separate config variables it would also need a different text in that case. Not worth the effort, but I can change the variable name or expand the comment if needed.
There was a problem hiding this comment.
IMO this should be
- Extracted into its own git repo, that uses lemmy-client-rs to fetch and write this
communities.jsonfile. - We can run this whenever we want, like before a release.
- Add this repo as a submodule, and put its git submodule
pathsomewhere inside the crate that needs it. Alternatively you could define thecommunities.jsonlocation as an env var, and include it as a mounted file in thedocker-compose.yml, so it can be read. - Read that file when creating the rows.
There was a problem hiding this comment.
I dont see what would be the benefit of all this extra complexity. If it requires another task to update then we will forget about that over time, and the list will get outdated. Right now it simply works, and if there is a problem we will notice it and will have time to fix it.
There was a problem hiding this comment.
Alright... the main thing that scares me is that it's relying on lemmy.ml being available during the build.
There was a problem hiding this comment.
That should usually be the case, if not we can simply restart the build. Or develop a different solution if it really turns out to be necessary.
| // Fetch communities themselves | ||
| let tasks = communities.iter().map(|c| async { | ||
| let context = context.reset_request_count(); | ||
| c.dereference(&context).await.ok(); | ||
| }); |
There was a problem hiding this comment.
Is this actually HTTP fetching data from maybe hundreds of communities?
Really we should only be creating instance and community rows locally, as this issue is about community discovery. IE instead of hammering these communities with fetches, we should be running Instance::read_or_create and Community::create with the supplied data.
There was a problem hiding this comment.
50 communities in total, along with recent posts so that the All tab gets populated on a new instance. I added a check is_new_instance() which reduces the amount of data fetched, and this way it doesnt take so long.
There was a problem hiding this comment.
What's the purpose of those http fetches? Don't we only need to fill the community table rows so that the communities are searchable? In that case we only need to do DB inserts.
There was a problem hiding this comment.
Not sure what you mean, we still need to get the data to insert from somewhere. Are you talking about embedding all that in the binary? Would be a lot more complicated to implement that way. And this doesnt only insert communities, but also the recent posts to have some initial content.
There was a problem hiding this comment.
let communities_json = include_str!(concat!(env!("OUT_DIR"), "/communities.json"));
let communities: Vec<ObjectId<ApubCommunity>> = serde_json::from_str(communities_json)?;You're reading a communities.json file, which called ListCommunities already, and has all the info necessary to fill community DB rows. There's no reason to then fetch data, you already have it.
Fetching initial content for 1000s of communities could be burdensome on a lot of servers, its probably not a good idea, especially since this issue should only be about getting communities.
There was a problem hiding this comment.
It only fetches 50 communities (adjusted the text), and only when a new Lemmy server is created which is not that often. Compared to the normal data fetches done by any active Lemmy instance this is not much.
I tried to change it to embed a Vec<CommunityView> in the binary instead. But this would require two separate API requests in build.rs to fetch /c/announcements and /c/lemmy, as well as an API request in debug builds to fetch /c/lemmy for testing. It also would be more complicated to fetch recent posts this way as CommunityView doesnt store the outbox_url. Anyway the fetching is quite fast and works well.
There was a problem hiding this comment.
Alright then, I spose its okay.
|
Merge whenever you like. |
|
I know that this is old but hardcoding the instance doesn't seem very decentralized especially concerning given the reputation that lemmy.ml has |
|
Already superceded and made configurable by #6276 |
https://lemmy.ml/c/lemmyto reduce cpu/network usage while making sure it works