Yes, a mono band sensor would be better for sure. (Thanks for including the figure for clarity for others!) The reality with the stacked sensor, as I understand it, is that the response functions for the different layers of silica mean that there really isn’t a red, green, and blue layer, per se, as silicon penetration depth is the filtering process, but one can think of the 3 layers in a Foveon as related to those colors, with a complicated and computationally intensive process to transform that info into color space that we understand.
Side note: this complicated process of deconvolution (?) of the data is the reason why foveon chips tend to be slow in post-processing and noise prone at higher ISO levels. In some ways they have the opposite problems introduced by the convolution / interpolation that is used to resolve samping in Bayer style sensors.
So all that to say, when you switch a Foveon chip to grayscale mode, it is producing a single integrated brightness value per pixel which strikes me as conveniently similar to mono-band sensors.