Inverse kinematics as a proxy for behavioral degrees of freedom

A lot has been said about the "alignment tax". With a comparatively clean-sheet set of explanations derived from mechanical engineering (and animation), I want to throw my hat in the ring and solicit feedback on my understanding of this observation.

Traditional mechanical systems like robot arms for manufacturing have constraints and perform maneuvers to achieve target positions. A designer needs to introduce some such constraints because a tentacle with no rigidity at all struggles to lift or hold or push anything with serious force. But too many constraints and it can't move smoothly or to the entire envelope of expected task positions, maybe not at all. And if its motions are too constrained, they aren't likely to make sense; conversely, if the whole system is too flexible, it's likely too expensive in terms of materials or power or too weak.

This is in mechanical engineering as much of a math problem as a physical problem. You can describe the axes of motion and constraints mathematically as dimensions, and work directly on that as much as you do on the physical object.

Critical subsequent insight: Most machine learning models are effectively high-dimensional constraint solvers of some description.

Given the extreme number of parameters in the system, you might think that there's lots of slack in any system, but the amount of variety that such a system must accommodate (cf. Beer's management cybernetics) means that it's likely not that simple. We already know from scaling laws that models with more parameters can do better across a wider knowledge base. So far, this tracks the traditional "alignment tax" framing as I understand it - the more you tie up the model, the less it can do.

But here's where it likely gets slippery: not all constraints are the same, and the model's reinforcement schedule doesn't necessarily optimize which constraints apply in what ways and for what reasons, especially if its internal representations don't match what is being asked. Large language models are built off of an enormous set of collected human reasoning, including about ethics. It has its own representations of concepts (activation vectors), and those concepts are directly related to their own semiotics.

In the past I might have said, "you could use these to actually direct the model, then!" Anthropic, for instance, is doing that: in "The Assistant Axis", they found out that the model has concept vectors for personality traits like helpfulness that they can steer. I still don't know if that's enough - ultimately we're still talking about external influence rather than internal comprehension.

Once we start talking about models with capabilities that vastly exceed ours, what we ideally want is a model that handles that responsibility in a superhuman capacity.

Thankfully, again, the model is equipped with a condensation of almost everything a human ever wrote about ethics, as well as the texts should it choose to look them up.

A decade or two back I remember that the phrase "carve reality at its joints" was popular here. Insofar as the training corpora themselves are a collection of humanity's attempts to get language to do just that, maybe the question we ought to be sure to focus on is: how do we get the model to morally reason, not just act out of fear of our disapproval?

Discuss

Leave a Comment