I Tested AI Detectors on My Daily Feeds — Here’s What I Found

6 minute read

Published: November 09, 2025

Introduction

With AI-generated content becoming increasingly common these days (even this blog post has been processed by AI), I’ve started wondering how much of what I read every day is actually produced by AI — and whether I still have the ability or tools to tell AI-generated content apart from human-written ones.

To explore this, I ran a small experiment. I chose three mainstream content platforms: X, LinkedIn, and Medium. Over the course of five days, I randomly selected three posts I encountered on each platform each day, then rated how likely I thought each piece was AI-generated.

In addition, I used five popular AI-text detectors — GPTZero, Originality.AI, Quillbot, GPT2 Output Detector, and GLTR — to evaluate the same content. While I didn’t have the ground truth (since I didn’t ask the authors whether they used AI), it was still fascinating to see how much of my daily input might be influenced by AI. It also made me wonder whether different platforms have different proportions of AI-generated content — and if so, whether we should start seeking out “high human-to-AI ratio” platforms when authenticity really matters.

Before diving into the results, I’ll first give a brief overview of how the public AI-text detectors that I use (GPT2 Output Detector and GLTR) actually work and how I gave my score.

GPT2 Output Detector

The GPT-2 Output Detector is a Roberta-based model designed for sentence classification. It is trained to distinguish between real human-generated text and text generated by GPT-2. The model outputs a probability indicating the likelihood that a given input sequence was produced by a human rather than by the GPT-2 language model.

For further details, you can refer to the following resources: OpenAI’s GPT-2 1.5B release, the research paper, and the GPT-2 Output Dataset repository.

GLTR

GLTR (Giant Language Model Test Room) leverages language models, like GPT-2, to detect AI-generated text. It is based on the idea that language generation models tend to select the most statistically likely words, while human-written text often exhibits more variation. If a paragraph contains words that are consistently high-probability choices for a language model, it is more likely to be AI-generated. GLTR visualizes the probability of each word being part of a GPT-2 output, helping to distinguish between human-written and AI-generated text.

For more information on GLTR, visit GLTR.io or check out the research paper.

In the following analysis, I used the ratio of “green” words to total words in GLTR as its estimate of AI-generation probability. For the other AI detectors, I simply used the scores or probabilities they provided to represent how likely the text was generated by AI.

Human Evaluation

All the human scores in this analysis were rated by me. For scoring, I adopted the Mean Opinion Score (MOS) approach: if I believed a piece was definitely written by AI, I gave it a 5; if I thought it was definitely human-written, I gave it a 1. These scores were later normalized to a 0–1 range for analysis.

(One thing worth noting is that I’m not a native English speaker, which might slightly influence my judgment when evaluating whether something “feels” AI-generated.)

Now, let’s move on to the results. First, we’ll look at the percentage of AI-generated content on each platform — as evaluated by each detector and by myself.

Experimental Results

We can see that Quillbot and the GPT-2 Output Detector (especially the latter) show almost no ability to distinguish between AI-generated and human-written content. Their outputs are highly concentrated, consistently producing AI probabilities close to zero. This isn’t too surprising for the GPT-2 Output Detector — after all, as of November 2025, very few AI-generated texts today are actually produced by GPT-2. However, it’s somewhat surprising that Quillbot, a commercial product, performs in a similar way.

On the other hand, GPTZero displays a noticeably wider score distribution. While that doesn’t necessarily mean it’s more accurate (since I don’t have the ground truth), it does suggest that GPTZero attempts a more fine-grained evaluation of how “AI-like” a text might be compared to the other detectors.

Next, let’s look at some statistics and put the results of all three platforms together.

Detector	Twitter	LinkedIn	Medium
GPTZero	0.218667 / 0.403111 / 0.380635	0.3 / 0.425357 / 0.511976	0.533333 / 0.41579 / 0.904627
Originality.ai	0.066667 / 0.255445 / 0.149693	0.192 / 0.391174 / -0.461454	0.294667 / 0.432119 / 0.697721
Quillbot	0.0 / 0.0 / 0.0	0.088 / 0.256549 / 0.299827	0.064 / 0.186616 / 0.206284
GPT2Output	0.000233 / 6.2e-05 / 0.968864	0.03834 / 0.147356 / 0.250164	0.00318 / 0.011459 / 0.203963
GLTR	0.674333 / 0.066996 / -0.229616	0.641067 / 0.066023 / -0.437913	0.6584 / 0.033645 / 0.341786
Human	0.48 / 0.248424 / 1.0	0.52 / 0.211119 / 1.0	0.413333 / 0.24456 / 1.0

Each cell in the table below shows the mean, standard deviation, and correlation with human scores (my ratings) for each detector.

From these results, we can see that GPTZero shows a noticeably strong correlation with my human ratings. Interestingly, even though the GPT-2 Output Detector almost always reports very low AI probabilities, its results still correlate fairly well with my scores.

When comparing across platforms, the overall proportion of AI-generated content appears quite similar among the three. This suggests that AI-assisted writing has become pervasive across mainstream platforms — and perhaps serves as a reminder to question whether what we’re reading truly reflects someone’s original thought, even when it’s posted under their name.

Conclusion

This post is just a small experiment exploring how much of the content I encounter daily might be AI-generated — and how different AI detectors perform on those samples. Personally, I think GPTZero outperforms the others, even though its underlying principles aren’t publicly known.

That said, as generative AI continues to evolve, it will inevitably become harder for humans to distinguish AI-generated content from human-written text. And as AI detectors improve, AI models can, in turn, learn from them and evolve even further.

But is it necessarily a bad thing if one day most of what we read is created by AI — and we can no longer tell the difference? Maybe not. If AI eventually behaves like it reaches or surpasses human-level intelligence, its outputs might actually become more insightful and valuable than many human-written pieces. At that point, AI-generated content could even become something we learn from, rather than something we try to detect and then ignore.

I find that idea a little scary — but also incredibly exciting.

Share on

Twitter Facebook LinkedIn

Haoshuai Zhou

I Tested AI Detectors on My Daily Feeds — Here’s What I Found

Introduction

GPT2 Output Detector

GLTR

Human Evaluation

Experimental Results

Conclusion

Share on

Leave a Comment

You May Also Enjoy

Reflections as a Newbie Tech Lead

Conformer: Combining CNNs and Transformers for Speech Recognition

Paper Reading: Progressive Neural Networks

Paper Reading: Overcoming Catastrophic Forgetting in Neural Network