Machine Learning
On November 13, 2025 by Jonathan Zdziarski
A while back, I wrote AI is Just Someone Else’s Intelligence. Since then, many many people (and when I say many many people, I mean nobody) have asked me what a legal framework for LLMs might look like, in light of their mathematical composition of training data. I’ve been thinking a lot about this lately, and have some ideas I just wanted to put down somewhere. I’ve run this past my dog, and she seems to think it a bit ruff, but could perhaps be a good foundation for some future legislation.
We know that LLMs (and other deep learning systems) are a mathematical prediction of content based on training data. This “digital dust cloud” spat out by an LLM is the reverse of plotting data in n-dimensional space. Instead of labeling a point on a graph, LLMs essentially point at some spot on a graph and compute the most probable token to exist there. This has been the basis for ML since the days of logistic regression.
Predicated upon this basic principle, a few different classifications can be built.
- Summarization A machine can be thought of as generating a summarization of ingested works when its output condenses a large amount of training input into a shorter, concise form. Some properties of a summarization are:
- Citations are given to acknowledge the primary sources of the input
- Fair use rules apply to the work as they would to any human-generated work
- Reproduction A machine produces a reproduction when it reconstructs a work resembling prior works from its training set. Reproductions take the unique qualities of a single solitary creator and apply them to a new work whose own characteristics materially depend. Some properties of reproductions include:
- The generated work includes unique, key characteristics learned from a unique family of training data.
- The value of the work is dependent upon these characteristics.
- Citations are given to acknowledge the solitary creator as the primary source of input.
- Composite A composite work is an output consisting of multiple unique, key characteristics from multiple families of training data. Much like a reproduction, however include unique qualities of multiple creators.
- The generated work includes multiple, unique characteristics that can be identified as individual reproductions from each creator (of an original work).
- The value of the work is dependent upon a combination of these characteristics.
- Citations are given to acknowledge the creators as primary sources of the input.
- Plagiarism A work of plagiarism is an output whose properties meet any one of the prior three classifications (summarization, reproduction, or composite), yet do not credit a majority of the output to one or more primary sources.
Such legal (and technical) classifications place the responsibility of content generation onto the creators of the AI, rather than on the user, and allow existing copyright laws to be applied in ways that will help protect the intellectual property of humans.