this post was submitted on 13 Sep 2023
64 points (100.0% liked)

Technology

37343 readers
118 users here now

Rumors, happenings, and innovations in the technology sphere. If it's technological news or discussion of technology, it probably belongs here.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
 

Avram Piltch is the editor in chief of Tom's Hardware, and he's written a thoroughly researched article breaking down the promises and failures of LLM AIs.

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 30 points 9 months ago* (last edited 9 months ago) (3 children)

They have the right to ingest data, not because they're “just learning like a human would". But because I - a human - have a right to grab all data that's available on the public internet, and process it however I want, including by training statistical models. The only thing I don't have a right to do is distribute it (or works that resemble it too closely).

In you actually show me people who are extracting books from LLMs and reading them that way, then I'd agree that would be piracy - but that'd be such a terrible experience if it ever works - that I can't see it actually happening.

[–] [email protected] 28 points 9 months ago* (last edited 9 months ago) (2 children)

Two things:

  1. Many of these LLMs -- perhaps all of them -- have been trained on datasets that include books that were absolutely NOT released into the public domain.

  2. Ethically, we would ask any author who parrots the work of others to provide citations to original references. That rarely happens with AI language models, and if they do provide citations, they often do it wrong.

[–] [email protected] 23 points 9 months ago (1 children)

I'm sick and tired of this "parrots the works of others" narrative. Here's a challenge for you: go to https://huggingface.co/chat/, input some prompt (for example, "Write a three paragraphs scene about Jason and Carol playing hide and seek with some other kids. Jason gets injured, and Carol has to help him."). And when you get the response, try to find the author that it "parroted". You won't be able to - because it wouldn't just reproduce someone else's already made scene. It'll mesh maaany things from all over the training data in such a way that none of them will be even remotely recognizable.

[–] [email protected] 16 points 9 months ago (3 children)

And yet, we know that the work is mechanically derivative.

[–] [email protected] 15 points 9 months ago* (last edited 9 months ago) (1 children)

So is your comment. And mine. What do you think our brains do? Magic?

edit: This may sound inflammatory but I mean no offense

[–] [email protected] 3 points 9 months ago

No, I get it. I'm not really arguing that what separates humans from machines is "libertarian free will" or some such.

But we can properly argue that LLM output is derivative because we know it's derivative, because we designed it. As humans, we have the privilege of recognizing transformative human creativity in our laws as a separate entity from derivative algorithmic output.

[–] [email protected] 8 points 9 months ago* (last edited 9 months ago) (1 children)

So is literally every human work in the last 1000 years in every context.

Nothing is "original". It's all derivative. Feeding copyrighted work into an algorithm does not in any way violate any copyright law, and anyone telling you otherwise is a liar and a piece of shit. There is no valid interpretation anywhere close.

[–] [email protected] 5 points 9 months ago* (last edited 9 months ago) (1 children)

Every human work isn't mechanically derivative. The entire point of the article is that the way LLMs learn and create derivative text isn't equivalent to the way humans do the same thing.

[–] [email protected] 4 points 9 months ago (1 children)

It's complete and utter nonsense and they're bad people for writing it. The complexity of the AI does not matter and if it did, they're setting themselves up to lose again in the very near future when companies make shit arbitrarily complex to meet their unhinged fake definitions.

But none of it matters because literally no part of this in any way violates copyright law. Processing data is not and does not in any way resemble copyright infringement.

[–] [email protected] 3 points 9 months ago (1 children)

This issue is easily resolved. Create the AI that produces useful output without using copyrighted works, and we don't have a problem.

If you take the copyrighted work out of the input training set, and the algorithm can no longer produce the output, then I'm confident saying that the output was derived from the inputs.

[–] [email protected] 2 points 9 months ago (1 children)

There is literally not one single piece of art that is not derived from prior art in the past thousand years. There is no theoretical possibility for any human exposed to human culture to make a work that is not derived from prior work. It can't be done.

Derivative work is not copyright infringement. Straight up copying someone else's work directly and distributing that is.

[–] [email protected] 3 points 9 months ago (1 children)

There is literally not one single piece of art that is not derived from prior art in the past thousand years.

This is false. Somebody who looks at a landscape, for example, and renders that scene in visual media is not deriving anything important from prior art. Taking a video of a cat is an original creation. This kind of creation happens every day.

Their output may seem similar to prior art, perhaps their methods were developed previously. But the inputs are original and clean. They're not using some existing art as the sole inputs.

AI only uses existing art as sole inputs. This is a crucial distinction. I would have no problem at all with AI that worked exclusively from verified public domain/copyright not enforced and original inputs, although I don't know if I'd consider the outputs themselves to be copyrightable (as that is a right attached to a human author).

Straight up copying someone else’s work directly

And that's what the training set is. Verbatim copies, often including copyrighted works.

That's ultimately the question that we're faced with. If there is no useful output without the copyrighted inputs, how can the output be non-infringing? Copyright defines transformative work as the product of human creativity, so we have to make some decisions about AI.

[–] [email protected] 1 points 9 months ago (1 children)

If they've seen prior art, yes, they are. It's literally not possible to be exposed to the history of art and not have everything you output be derivative in some manner.

Processing and learning from copyrighted material is not restricted by current copyright law in any way. It cannot be infringement, and shouldn't be able to be infringement.

[–] [email protected] 3 points 9 months ago (1 children)

It’s literally not possible to be exposed to the history of art and not have everything you output be derivative in some manner.

I respectfully disagree. You may learn methods from prior art, but there are plenty of ways to insure that content is generated only from new information. If you mean to argue that a rendering of landscape that a human is actually looking at is meaningfully derivative of someone else's art, then I think you need to make a more compelling argument than "it just is".

[–] [email protected] 1 points 9 months ago

Seeing how other pictures are framed is exactly identical to seeing how other stories are written.

[–] [email protected] 4 points 9 months ago* (last edited 9 months ago) (1 children)

From Wikipedia, "a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work".

You can probably can the output of an LLM 'derived', in the same way that if I counted the number of 'Q's in Harry Potter the result derived from Rowling's work.

But it's not 'derivative'.

Technically it's possible for an LLM to output a derivative work if you prompt it to do so. But most of its outputs aren't.

[–] [email protected] 4 points 9 months ago (1 children)

a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work

What was fed into the algorithm? A human decided which major copyrighted elements of previously created original work would seed the algorithm. That's how we know it's derivative.

If I take somebody's copyrighted artwork, and apply Photoshop filters that change the color of every single pixel, have I made an expressive creation that does not include copyrightable elements of a previously created original work? The courts have said "no", and I think the burden is on AI proponents to show how they fed copyrighted work into an mechanical algorithm, and produced a new expressive creation free of copyrightable elements.

[–] [email protected] 4 points 9 months ago* (last edited 9 months ago)

I think the test for "free of copyrightable elements" is pretty simple - can you look at the new creation and recognize any copyrightable elements in it? The process by which it was created doesn't matter. Maybe I made this post entirely by copy-pasting phrases from other people, who knows (well, I didn't, only because it would be too much work), but it does not infringe either way...

[–] [email protected] 4 points 9 months ago (2 children)

Is there a meaningful difference between reproducing the work and giving a summary? Because I’ll absolutely be using AI to filter all the editorial garbage out of news, setup and trained myself to surface what is meaningful to me stripped of all advertising, sponsorships, and detectable bias

[–] [email protected] 10 points 9 months ago (1 children)

When you figure out how to train an AI without bias, let us know.

[–] [email protected] 7 points 9 months ago (3 children)

You’re confusing ai with chatgpt, but to answer your question: if it’s my own bias, why would I care that it’s in my personal ai? That’s kind of the point: using my personal lens (bias) to determine what info I would be interested in being alerted of

[–] [email protected] 6 points 9 months ago

oooh I dunno man having an AI feed you shit based on what fits your personal biases is basically what social media already does and I do not think that's something we need more of.

[–] [email protected] 5 points 9 months ago

You’re confusing ai with chatgpt

?????????

[–] [email protected] 5 points 9 months ago* (last edited 9 months ago)

I have yet to find an LLM that can summarize a text without errors. I already mentioned this in another post a few days back, but Google‘s new search preview is driving me mad with all the hidden factual errors. They make me click only to realize that the LLM told me what I wanted to find, not what is there (wrong names, wrong dates, etc.).

I greatly prefer the old excerpt summaries over the new imaginary ones (they‘re currently A/B testing).

[–] [email protected] 20 points 9 months ago* (last edited 9 months ago) (1 children)

You're making two, big incorrect assumptions:

  1. Simply seeing something on the internet does not give you any legal or moral rights to use that thing in any way other than things which are, or have previously been, deemed to be "fair use" by a court of law. Individuals have personal rights over their likeness and persona, and copyright holders have rights over their works, whether they are on the internet or not. In other words, there is a big difference between "visible in public" and "public domain".
  2. More importantly, something that might be considered "fair use" for a human being do to is not necessary "fair use" when a computer or "AI" does it. Judgements of what is and is not fair use are made on a case by case basis as a legal defense against copyright infringement claims, and multiple factors (purpose of use, nature of original work, degree and sustainability of use, market effect, etc.) are often taken into consideration. At the very least, AI use has serious implications on sustainability and markets, especially compared to examples of human use.

I know these are really tough pills for AI fans to swallow, but you know what they say... "If it seems too good to be true, it probably is."

[–] [email protected] 10 points 9 months ago* (last edited 9 months ago)

One the contrary - the reason copyright is called that is because it started as the right to make copies. Since then it's been expanded to include more than just copies, such as distributing derivative works

But the act of distribution is key. If I wanted to, I could write whatever derivative works in my personal diary.

I also have the right to count the number of occurrences of the letter 'Q' in Harry Potter workout Rowling's permission. This I can also post my count online for other lovers of 'Q', because it's not derivative (it is 'derived', but 'derivative' is different - according to Wikipedia it means 'includes major copyrightable elements').

Or do more complex statistical analysis.

[–] [email protected] 14 points 9 months ago* (last edited 9 months ago)

I think this opinion is going to be looked at a lot like the anti-privacy arguments when Facebook and Google were first revealed to be massively invading people's privacy.

We look at those platforms with disdain now, but at the time all you ever heard people saying over and over was "If you have nothing to hide who cares about privacy?" and "Anything you put on the Internet is fair game." and "your privacy is already gone, nothing we can or should do now."

And then that careless attitude led to things that those people hadn't foreseen, like the Cambridge Analytica scandal, massive troll farm campaigns and Trump's election.

Looking back we're going to see this argument about data scraping to fuel LLMs in the same way.

[–] [email protected] 23 points 9 months ago (5 children)

I like the point about LLMs interpolating data while humans extrapolate. I think that's sums up a key difference in "learning". It's also an interesting point that we anthropomorphise ML models by using words such as learning or training, but I wonder if there are other better words to use. Fitting?

[–] [email protected] 14 points 9 months ago

"Plagiarizing" 😜

[–] [email protected] 10 points 9 months ago (2 children)

Isn't interpolation and extrapolation the same thing effectively, given a complex enough system?

[–] [email protected] 2 points 9 months ago (1 children)

No, repeated extrapolation results in eventually making everything that ever could be made, constant interpolation would result in creating the same "average" work over and over.

The difference is infinite vs zero variety.

load more comments (1 replies)
[–] [email protected] 2 points 9 months ago

Depending on the geometry of the state space, very literally yes. Think about a sphere, there's a straight line passing from Denver to Guadalajara, roughly hitting Delhi on the way. Is Delhi in between them (interpolation), or behind one from the other (extrapolation)? Kind of both, unless you move the goalposts to add distance limits on interpolation, which could themselves be broken by another geometry

[–] [email protected] 8 points 9 months ago

What about tuning, to align with "finetuning?"

[–] [email protected] 6 points 9 months ago

I also like the point about interpolation vs extrapolation. It's demonstrated when you look at art history (or the history of any other creative field). Humans don't look at paintings and create something that's predictable based on those paintings. They go "what happens when I take that idea and go even further?" An LLM could never have invented Cubism after looking at Paul Cezanne's paintings, but Pablo Picasso did.

[–] [email protected] 2 points 9 months ago

That's not a limitation of ML, but just how it is commonly used. You can take every parameter that neural network recognizes and tweak it, make it bigger, smaller, recombine it with other stuff and marvel at the results. That's how we got origami porn, (de)cartoonify AI, QR code art, Balenciaga, dancing statues or my 5min attempt at reinventing cubism (tell AI to draw cubes over a depthmap).

[–] [email protected] 15 points 9 months ago (2 children)

Let's be clear on where the responsibility belongs, here. LLMs are neither alive nor sapient. They themselves have no more "rights" than a toaster. The question is whether the humans training the AIs have the right to feed them such-and-such data.

The real problem is the way these systems are being anthropomorphized. Keep your attention firmly on the man behind the curtain.

[–] [email protected] 7 points 9 months ago

Yes, these are the same people who are charging a fee to use their AI and profiting. Placing the blame and discussion on the AI itself conveniently overlooks a lot here.

[–] [email protected] 2 points 9 months ago* (last edited 9 months ago) (1 children)

You know, I think ChatGPT is way ahead of a toaster. Maybe it's more like a small animal of some kind.

[–] [email protected] 2 points 9 months ago (1 children)

One could equally claim that the toaster was ahead, because it does something useful in the physical world. Hmm. Is a robot dog more alive than a Tamagotchi?

[–] [email protected] 1 points 9 months ago* (last edited 9 months ago) (1 children)

There are a lot of subjects where ChatGPT knows more than I do.

Does it know more than someone who has studied that subject their whole life? Of course not. But those people aren't available to talk to me on a whim. ChatGPT is available, and it's really useful. Far more useful than a toaster.

As long as you only use it for things where a mistake won't be a problem - it's a great tool. And you can also use it for "risky" decisions but take the information it gave you to an expert for verification before acting.

[–] [email protected] 3 points 9 months ago

Sorry to break it to you, but it doesn't "know" anything except what text is most likely to come after the text you just typed. It's an autocomplete. A very sophisticated one, granted, but it has no notion of "fact" and no real understanding of the content of what it's saying.

Saying that it knows what it's spouting back to you is exactly what I warned against up above: anthropomorphization. People did this with ELIZA too, and it's even more dangerous now than it was then.

[–] [email protected] 12 points 9 months ago (1 children)

Machines don't Lear like humans yet.

Our brains are a giant electrical/chemical system that somehow creates consciousness. We might be able to create that in a computer. And the day it happens, then what will be the difference between a human and a true AI?

[–] [email protected] 3 points 9 months ago (1 children)

If you read the article, there's "experts" saying that human comprehension is fundamentally computationally intractable, which is basically a religious standpoint. Like, ChatGPT isn't intellegent yet, partly because it doesn't really have long term memory, but yeah, there's overwhelming evidence the brain is a machine like any other.

[–] [email protected] 2 points 9 months ago (3 children)

fundamentally computationally intractable

...using current AI architecture, and the insight isn't new it's maths. This is currently the best idea we have about the subject. Trigger warning: Cybernetics, and lots of it.

Meanwhile yes of course brains are machines like any other claiming otherwise is claiming you can compute incomputable functions which a physical and logical impossibility. And it's fucking annoying to talk about this topic with people who don't understand computability. Usually turns into a shouting match of "you're claiming the existence of something like a soul, some metaphysical origin of the human mind" vs. "no I'm not" vs. "yes you are but you don't understand why".

load more comments (3 replies)
[–] [email protected] 11 points 9 months ago (1 children)

There's a lot of opinion in here written in as if it's fact.

load more comments (1 replies)
[–] [email protected] 3 points 9 months ago* (last edited 9 months ago) (1 children)

In the long run it doesn't really matter if the LLM is or is not trained on all the information out there, as the LLM will be able to search the Web on demand and report back with what it finds. BingChat essentially already does that and we have a few summarizer bots doing similar jobs. The need to access Websites directly and wade through all the clickbait and ads in the hope you find the bit of information you are actually interested will be over.

The LLM will be Adblock, ReaderMode, SQL and a lot more rolled into one, a Swiss army knife for accessing and transforming information. Not sure where that leaves the journalists, but cheap clickbait might lose a lot of value.

[–] [email protected] 1 points 9 months ago

Bruh, the LLM will be trained on 90% click bait. It's gonna be just as trashy.

[–] [email protected] 3 points 9 months ago

🤖 I'm a bot that provides automatic summaries for articles:

Click here to see the summaryUnfortunately, many people believe that AI bots should be allowed to grab, ingest and repurpose any data that’s available on the public Internet whether they own it or not, because they are “just learning like a human would.” Once a person reads an article, they can use the ideas they just absorbed in their speech or even their drawings for free.

Iris van Rooj, a professor of computational cognitive science at Radboud University Nijmegen in The Netherlands, posits that it’s impossible to build a machine to reproduce human-style thinking by using even larger and more complex LLMs than we have today.

NY Times Tech Columnist Farhad Manjoo made this point in a recent op-ed, positing that writers should not be compensated when their work is used for machine learning because the bots are merely drawing “inspiration” from the words like a person does.

“When a machine is trained to understand language and culture by poring over a lot of stuff online, it is acting, philosophically at least, just like a human being who draws inspiration from existing works,” Manjoo wrote.

In his testimony before a U.S. Senate subcommittee hearing this past July, Emory Law Professor Matthew Sag used the metaphor of a student learning to explain why he believes training on copyrighted material is usually fair use.

In fact, Microsoft, which is a major investor in OpenAI and uses GPT-4 for its Bing Chat tools, released a paper in March claiming that GPT-4 has “sparks of Artificial General Intelligence” – the endpoint where the machine is able to learn any human task thanks to it having “emergent” abilities that weren’t in the original model.


Saved 93% of original text.

[–] [email protected] 2 points 9 months ago

That's a philosophical debate we can't really answer and not a lie, the question is if we do anything other than copy. The without any doubt biggest elephant in the room is the fact that AIs don't remember and iterate like we do yet but that's probably just a matter of time, other than that the very different environment we learn in is another huge issue if you try to make any comparison. It's a tricky question that we might never know the answe to but it's also facinating to think about and I don't think rejecting the idea alltogether is a especially good answer.

load more comments
view more: next ›