You’re right, all images have a local copy of all variables, no matter if they’re declared as coarray or not. The allocatable arrays (non-coarrays) don’t have to be allocated on all images, and they don’t need to have the same extent (lower and upper bounds). Coarrays do need to be allocated with the same lower and upper bounds on all images, and on all images at the same time (allocating a coarray is a blocking operation).
Back to neural-fortran, I looked at how it’s implemented and it seems like all images have a copy of the full input arrays x and y. Each image only trains on its own portion of the workload, after which the weights and biases are globally updated using co_sum. Of course, this is not optimal for memory use, but allows a very simple high-level API which uses exactly the same code for serial and parallel execution. In other words, the parallel work distribution is managed inside of network_type % train_batch(), and not outside of it.
A more optimal approach memory-wise would be for each image to read only its portion of the data, and call network_type % fwdprop(), network_type % backprop(), and network_type % update() directly. You’d need to take care of data exchange yourself, but it’s simple enough, you can just follow how that’s implemented in network_type % train_batch().