One of the biggest challenges with cloud-based inferencing for LLMs is keeping user data private. Is it possible to split the computational graph into several parts and use both local and cloud machines to address this issue?
For example, could we run the first and last layers of an LLM on a local machine to protect data privacy, and use the cloud for the rest to speed things up (assuming the cloud does not have access to the weights of the layers running locally)?
This isn’t directly a programming question, but I ask it here because it requires a deep understanding of LLM implementation and how it works under the hood.