About a month ago, Marco Kuhlmann asked on twitter whether the universal recognition problem for non-permuting LCFRSs is PSPACE-complete. (Non-permuting LCFRSs were called monotone LCFRSs by Michaelis and by Kracht. They do not allow arguments of the same nonterminal on the right-hand side of a rule to appear in reversed order on the left-hand side.) Good question. Kaji et al. (1992) proved that the universal recognition for arbitrary LCFRSs is PSPACE-complete by reduction from an arbitrary language in PSPACE. The LCFRS that is the output of the reduction is not non-permuting, and converting a given LCFRS to a non-permuting normal form may result in an exponential blowup in size, so this result by itself does not answer Marco's question. I couldn't give an answer immediately because I was not familiar with the details of Kaji et al.'s proof. I found the proof not so easy to read, but, after spending a few days trying to understand it, I found that it is easy to modify their construction to make the output always non-permuting. So the answer to Marco's question is, yes, it is PSPACE-complete.
Kaji et al.'s construction is designed to simulate an accepting computation of a polynomial-space-bounded Turing machine by a derivation of an LCFRS of rank 1 and dimension (or fan-out or arity) proportional to the the number of tape cells used in the computation. Although this construction works for arbitrary polynomial-space-bounded Turing machines, the reduction is from a fixed language \(L\) in PSPACE to the universal recognition problem for general LCFRSs, i.e., the reduction transforms a string \(w\) to a pair \((G_w,z_w)\) consisting of an LCFRS and a string such that \(w \in L\) if and only if \(z_w \in L(G_w)\). So you work with a fixed Turing machine \(M\) which accepts \(L\) and operates in space \(p(n)\), a fixed polynomial function of \(n\). The important point is that the function \(p(n)\) doesn't vary from input to input. (Otherwise the reduction can't be carried out in polynomial time.)
Every nonterminal \(A\) of \(G_w\) except the start nonterminal \(S\) has the same dimension \(d\), which is a constant multiple of \(p(|w|)\), the space used in a possible accepting computation on input \(w\). A tuple that is a possible argument of nonterminal \(A\) consists of two parts—most of the components of the tuple represent the contents of \(M\)'s tape, and the remaining components (of which there is just one in Kaji et al.'s original construction) act as a kind of "trash can". Each cell of \(M\)'s tape is represented by a tuple consisting of \(\epsilon\)s and 1s. Although there are more efficient coding, it suffices to take a tuple of the form \[ \epsilon,\dots,\epsilon,1,\epsilon,\dots,\epsilon \] whose length \(t\) equals the size of the tape alphabet of \(M\). The position of the unique occurrence of \(1\) tells what symbol is occupying the cell. The head position and the state of \(M\) can be coded in nonterminals, so that a "derived fact" \[ A_{q,i}(\ulcorner c_1 \urcorner,\dots,\ulcorner c_{p(n)} \urcorner,\epsilon) \] represents a configuration of \(M\) in which \(M\) is in state \(q\), its head is at the \(i\)th cell from the left, and the tape content is \(c_1 \dots c_{p(n)}\). Here, \(\ulcorner c \urcorner\) is the above tuple representation of the symbol \(c\), and the final argument of \(A_{q,i}\) is the trash can, which is always required to contain \(\epsilon\).
A single step in \(M\)'s computation on \(w = c_1 \dots c_n\) is accounted for by a single rule of \(G_w\). Let us assume for simplicity that a single step of \(M\) either just rewrites the symbol under the head (without moving) or just moves the head one cell to the left or right (without writing). (Remember that \(M\) is not part of the input to reduction, so we can take any polynomial-space-bounded Turing machine we like that accepts \(L\).) The first type of move corresponds to a rule of \(G_w\) of the form \begin{equation} \label{rule1} A_{r,i}(\vec{x_1},\dots,\vec{x_{i-1}},\underbrace{\epsilon,\dots,\epsilon}_{l-1},x_{i,k},\underbrace{\epsilon,\dots,\epsilon}_{t-l},\vec{x_{i+1}},\dots,\vec{x_{p(n)}},x_{i,1} \dots x_{i,k-1}x_{i,k+1}x_{i,t}y) \leftarrow A_{q,i}(\vec{x_1},\dots,\vec{x_{p(n)}},y), \end{equation} and the second type of move corresponds to a rule of the form \begin{equation} \label{rule2} A_{r,i \pm 1}(\vec{x_1},\dots,\vec{x_{p(n)}},y) \leftarrow A_{q,i}(\vec{x_1},\dots,\vec{x_{p(n)}},y). \qquad\qquad\mbox{} \end{equation} Here, \(\vec{x_j}\) stands for a tuple of variables \(x_{j,1},\dots,x_{j,t}\) of length \(t\). The change from \[ x_{i,1},\dots,x_{i,t} \] to \[ \underbrace{\epsilon,\dots,\epsilon}_{l-1},x_{i,k},\underbrace{\epsilon,\dots,\epsilon}_{t-l} \] in (\ref{rule1}) expresses the rewriting of the symbol represented by \[ \underbrace{\epsilon,\dots,\epsilon}_{k-1},1,\underbrace{\epsilon,\dots,\epsilon}_{t-k} \] with another symbol; it moves the \(k\)th component of the tuple to the \(l\)th component, while moving all the other components to the trash can.
The initial configuration of \(M\) corresponds to the unique terminating rule \begin{equation} \label{rule3} A_{q_I,1}(\ulcorner c_1 \urcorner,\dots,\ulcorner c_n \urcorner,\ulcorner b \urcorner,\dots,\ulcorner b \urcorner, \epsilon) \leftarrow , \end{equation} where \(q_I\) is the initial state of \(M\) and \(b\) stands for the blank symbol. For each final state \(q\) and \(i=1,\dots,p(n)\), there is the rule \begin{equation} \label{rule4} S(x_{1,1}\dots x_{1,t} \dots x_{p(n),1} \dots x_{p(n),t} \# y) \leftarrow A_{q,i}(\vec{x_1},\dots,\vec{x_{p(n)}},y), \;\mbox{} \end{equation} which concatenates all the tuples representing the tape and affixes \(\#\) followed by the content of the trash can. Setting \(z_w = 1^{p(n)} \# \) enforces the requirement that the string in the trash can be always empty.
Now rules of \(G_w\) of type (\ref{rule1}) are not non-permuting, because variables that get moved to the trash can may move over other variables. To avoid this, an easy solution is to place trash cans to the left and to the right of every tuple representing a tape cell: \begin{equation} \tag{$1'$} A_{r,i}(y_0,\vec{x_1},y_1,\dots,\vec{x_{i-1}},y_{i-1} x_{i,1} \dots x_{i,k-1},\underbrace{\epsilon,\dots,\epsilon}_{l-1},x_{i,k},\underbrace{\epsilon,\dots,\epsilon}_{t-l},x_{i,k+1} \dots x_{i,t} y_i,\vec{x_{i+1}},y_{i+1},\dots,\vec{x_{p(n)}},y_{p(n)}) \leftarrow A_{q,i}(y_0,\vec{x_1},y_1,\dots,\vec{x_{p(n)}},y_{p(n)}). \end{equation} The dimension \(d\) of nonterminals (other than \(S\)) has changed, so we need to adjust rules (\ref{rule2}) and (\ref{rule3}) accordingly: \begin{gather} \tag{$2'$} A_{r,i \pm 1}(y_0,\vec{x_1},y_1,\dots,\vec{x_{p(n)}},y_{p(n)}) \leftarrow A_{q,i}(y_0,\vec{x_1},y_1,\dots,\vec{x_{p(n)}},y_{p(n)}), \\ \tag{$3'$} A_{q_I,1}(\epsilon,\ulcorner c_1 \urcorner,\epsilon,\dots,\ulcorner c_n \urcorner,\epsilon,\ulcorner b \urcorner,\epsilon,\dots,\ulcorner b \urcorner, \epsilon) \leftarrow . \end{gather} To force each trash can to be empty, we now let \(z_w = (\# 1 \#)^{p(n)}\) and replace (\ref{rule4}) by \begin{equation} \tag{$4'$} S(y_0 \# x_{1,1} \dots x_{1,t} \# y_1 \# x_{2,1} \dots x_{2,t} \# y_2 \dots x_{p(n),1} \dots x_{p(n),t} \# y_{p(n)}) \leftarrow A_{q,i}(y_0,\vec{x_1},y_1,\dots,\vec{x_{p(n)}},y_{p(n)}). \end{equation}
With these changes, the grammar \(G_w\) is now non-permuting, and since it is of rank 1, it is also well-nested. So this shows that universal recognition for well-nested MCFGs (LCFRSs) is also PSPACE-complete, something I have always claimed without realizing that Kaji et al.'s proof needs to be modified to establish it.