Pattern Matching is Fun

Playing with ChatGPT earlier, I was fascinated by how obvious patterns were apparent in most of the outputs (read through to the end for a demo ).

Mainly because of certain structures and similarities and I wondered around how cleaning it all up, and converting it to a version that wasn’t so obviously AI would work out in practice.

First off, it’s difficult!

Reading stuff as a human is shelling peas easy, we see things and say aha! Look at that. But getting a script to do the same around specifics is a different ball game, especially from the perspective of language rules and syntax.

From the perspective of going from AI to a more humanised version there are a few things to consider when evaluating the text of course.

How does it read? Is it stilted or grammatically nonsensical? Is it repetitious, are some words used too many times to the point of ridiculousness? Are there common gotchas and things like that.

It’s all in the text…

So, one of the things I had noticed when I asked ChatGPT to write articles was a tendency to create numbered lists of things. In the sample text below, you’ll find that each of the tips is constructed in the form Statement: Method and Reason.

Whether this is, or isn’t a problem depends on the type of work being performed, but most humans wouldn’t used that format all of the time, and if they did then their articles would soon begin to look a little samey. So a means of identifying that and doing something with it could be useful

So far, I’ve got as far as catch the “:” instance, grab the words that precede it, create a sub heading and then use the remaining text to form a paragraph and spit the output into a html file for use later on.

Reasoning here might be that having a list of reasons set up differently with subheadings and paragraphs might break any expected pattern matching and humanise it slightly.

Word count, Lists, Foot prints yada yada

Of course, it isn’t entirely about structure, it’s also about the language or words used too. I’d read about foot printing of AI output and that can be seen, if you look closely enough.

  • Opening paragraphs that I produced often introduced the topic outlining a hypothesis and then followed up with a reasoning around how, and well…, that’s what we humans tend to do too right?
  • The reasoning part of these opening paragraphs in the second sentence often started with the word By. “We shall do amazing things blah blah. By doing this, we shall ensure that x will equal y…
  • The list or tips, or steps provided were all pretty similar in terms of length and word count, at least within a reasonably close number
  • The steps or tips were usually followed by a concluding paragraph that summed up the article with a few pithy words, that for me often lacked emotion or soul.

Things we could look at

So in terms of the readability stuff we could do the whole Flesch-Kincaid Grade Level formula or the Gunning Fog Index. These formulas typically use measures such as average sentence length and the number of syllables per word to determine the readability of the text.

The issue here is that if the AI is aware of them, then the better ones will bake those factors in, so detecting them isn’t immediately useful, until perhaps you consider that maybe not all humans write in the same way and so, if a document was consistently around a score, then its footprint would (if considered with others) be more easily identifiable.

The use of the word “By” in the method part of the steps for instance could be flagged and identified as potentially problematic. We could use a little script to do a word count thing and determine a list of human defined uh ohs.

We could as a result, ensure that our steps, were formatted differently or weren’t too close length wise – humanised, tick!

We could even introduce typos and purposeful grandma mistakes 😉 and spellinks. AI doesn’t make those mistakes after all – humanised, tick!

Most importantly mind, we could make sure that we always edited what we were intending to publish and made sure it was fit for purpose – humanised, tick!

It is a little bonkers that people will be looking to humanise AI generated text, but humanise they must, else they’ll likely find themselves in the bad graces of Google and we know how that ends up.

In the short term perhaps, writers will be worth their weight in peanuts as their Labour value plummets. One envisions a dystopian production line of writers assigned with creating hundreds of 1000 word articles per day. The Fivrr brigade are no doubt wetting their pants with equal measures of glee and despair too no doubt, as for a time, it’s open season.

The flip-side of this is that good editors should be more precious a resource to have, as the smarter publishers realise that a human touch matters more now, than ever.

Anyways, click the blue text options that appear below for a demo.

It will take a sample block of text and look for a certain pattern in the text using a regex /^.+?:\s/gm in this case the colon (:) so that all words before it are converted to headings and that all words after form paragraphs.

It will also count the number of words in each paragraph, enabling you to see how many words are in each. It will also show a comparison of what I’ve written contained within a div id of “robswords” which essentially contains the words of my blog post.

Finally, it will produce an exportable table of words, along with the html of the output text if needed.

It will work for different blocks of text, but won’t create html output where a colon separator is absent. The sample text has colons to demonstrate; you could of course add other separators and increased sophistication, but maybe that’s for someone smarter than me.

Load Sample Text or paste your own.

Published by Rob Watts

I've worked in search for over 25 years with businesses of all shapes and sizes.