<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://zhs2326.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://zhs2326.github.io/" rel="alternate" type="text/html" /><updated>2026-04-11T09:26:23+00:00</updated><id>https://zhs2326.github.io/feed.xml</id><title type="html">Hi, I’m Haoshuai Zhou!</title><subtitle>personal description</subtitle><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><entry><title type="html">Reflections as a Newbie Tech Lead</title><link href="https://zhs2326.github.io/posts/2026/04/blog-post-1/" rel="alternate" type="text/html" title="Reflections as a Newbie Tech Lead" /><published>2026-04-11T00:00:00+00:00</published><updated>2026-04-11T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2026/04/reflections-as-a-newbie-tech-lead</id><content type="html" xml:base="https://zhs2326.github.io/posts/2026/04/blog-post-1/"><![CDATA[<p><em>I originally wrote this post in Chinese on Xiaohongshu (小红书) in February 2025, where it received a wonderful response with many likes and saves. To reach a wider audience, I’ve translated it into English and shared it here. I hope you find it useful!</em></p>

<hr class="post-intro-divider" />

<p>Since Jul 2021, I’ve been working for three and a half years, and I’ve spent nearly 1.5 of those years in a tech lead role. Over the past six months, I’ve gradually transitioned into a team lead position, learning a lot — both technical and non-technical. Here, from the perspective of a technically-oriented newbie leader, I’d like to share some of my takeaways.</p>

<h2 id="1-keeping-your-technical-skills-sharp-is-not-easy-for-a-new-tech-lead">1. Keeping Your Technical Skills Sharp Is Not Easy for a New Tech Lead</h2>

<p>Transitioning into a tech lead role usually happens because of strong technical ability. But for newcomers — especially younger ones — their technical edge often comes more from grinding hard than from deep accumulated experience. Once you become a tech lead, the demand for technical breadth skyrockets: you need to help team members solve all kinds of problems. The pace of technical growth and problem-solving ability that was previously “good enough” may no longer cut it. At this point, you must consciously push yourself to learn new technologies faster, dive into unfamiliar problems more efficiently, and refine the small habits that impact your effectiveness.</p>

<h2 id="2-be-the-first-to-explore-new-platforms-and-technologies">2. Be the First to Explore New Platforms and Technologies</h2>

<p>In my view, when a new tech lead encounters a new platform or technology, they should always be the first to jump in. On one hand, pioneering new platforms is inherently difficult — this kind of tough nut needs to be cracked by the tech lead. A tech lead should leverage their sharp technical instincts to help the team navigate early pitfalls and identify key issues, so that when it’s handed off to others, only minimal follow-up is needed. On the other hand, if you don’t understand the platform yourself, trying to step in and direct things later becomes very difficult. People may see you as out of your depth and lose confidence in you — especially if you’re a young tech lead.</p>

<h2 id="3-dont-be-afraid-of-conflict">3. Don’t Be Afraid of Conflict</h2>

<p>If a tech lead could just stay at the technical level and let the work speak for itself (which is already hard enough), then as a team lead, trying to completely avoid conflict is nearly impossible. In fact, since I became a tech lead, almost everyone on the team has had arguments with me at some point — including my own leader. In my view, a team without disagreements is a team without vitality. If everyone always agrees, what do you even need a whole team for? What matters is how you navigate or even harness conflict to lead people toward doing the right — and often the hard — thing. Behind every argument is a difference of opinion. How to use disagreements to draw out diverse perspectives and ultimately build alignment is something every leader must continuously think about and practice.</p>

<h2 id="4-staying-approachable-is-crucial">4. Staying Approachable Is Crucial</h2>

<p>Until a few months ago, I loved using my so-called “sharp” thinking to immediately point out problems the moment someone shared an idea, and then offer a more efficient path. Gradually, I realized this was making everyone conform to my way of thinking — the team was becoming a clone of me. When people talked to me, their reasoning was carefully curated; they would hide the uncertain parts, the parts most likely to expose mistakes. This is dangerous. Many critical details, potential opportunities, and shifts in team morale become invisible to you. And your relationships suffer too — people don’t even want to chat with you outside of work. So I’ve come to realize that staying deliberately approachable is essential. Don’t immediately shoot down others’ ideas. Don’t fixate on small mistakes. Never make people afraid to express what they truly think in front of you. Let people feel safe to communicate, safe to make mistakes, and free to have their own unique space — that’s what gives a team real energy.</p>

<h2 id="5-personal-character-matters--a-lot">5. Personal Character Matters — A Lot</h2>

<p>I once saw an image that vividly illustrated the difference between a boss and a leader: the boss sits on a carriage, whipping subordinates to pull the cart, while the leader stands at the front, pulling alongside everyone. Frankly, I have no respect for the “boss” type. I believe everyone should aspire to be a leader, not a boss.</p>

<p>Even though I’m the team lead, I’m actually the second youngest person on the team — the oldest member is nearly 20 years my senior. Given my experience, encountering resistance while leading the team is perfectly normal. Fortunately, this pushed me to start practicing <em>influence without authority</em> early on. I believe authority can be a useful tool for a leader, but relying solely on authority to lead people is a recipe for total failure. The moment you leave that position, you’ll quickly learn what it feels like when people treat you differently.</p>

<p>A good leader should consciously avoid leveraging the privileges of their position. Instead, understand people’s needs, find common interests between individuals and the team, and make people <em>want</em> to embrace your ideas and create greater value together. Beyond that, lead by example — teach through competence, inspire through conduct. I believe a leader can be lenient with others but must be strict with themselves. That’s why I almost never skip morning standup. When the team has an important task and teammates are there, I won’t leave either — even if I’m just sitting there, I’ll stay.</p>

<p>Finally, being a leader is just one role at work. In life, you absolutely cannot expect to lead everyone’s thoughts and emotions at all times. If you can only find your sense of purpose at work, you’ll end up losing more than you gain in life. Every person is a vivid, independent individual. Through working together, I’ve come to see the real sides of everyone — their sensitivity, vulnerability, strength, resilience, selfishness, selflessness… and they can see mine too. Truly appreciating the uniqueness of every person around you, and making people <em>want</em> to work with you because of <em>who you are</em> — that’s what matters most.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="leadership" /><category term="career" /><category term="tech lead" /><summary type="html"><![CDATA[From a technically-oriented newbie leader — takeaways on staying sharp, navigating conflict, and leading with character.]]></summary></entry><entry><title type="html">Conformer: Combining CNNs and Transformers for Speech Recognition</title><link href="https://zhs2326.github.io/posts/2026/03/blog-post-1/" rel="alternate" type="text/html" title="Conformer: Combining CNNs and Transformers for Speech Recognition" /><published>2026-03-21T00:00:00+00:00</published><updated>2026-03-21T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2026/03/Conformer</id><content type="html" xml:base="https://zhs2326.github.io/posts/2026/03/blog-post-1/"><![CDATA[<h1 id="conformer">Conformer</h1>

<p>Conformer is a model architecture popularly used in automatic speech recognition (ASR), which combines the strengths of CNN and Transformer by deeply integrating these two structures. To have a deep understanding of it, we should not only remember its components, but also understand why it’s designed in that way.</p>

<p><br /></p>

<h1 id="conformer-block-overview">Conformer Block Overview</h1>

<p><img src="/images/posts/conformer/image.png" alt="Conformer Block Overview" /></p>

<p>The core of the Conformer architecture is the conformer block, which essentially has 5 components: Feed Forward Module (FFN) → Multi-Head Self Attention Module (MHSA) → Convolution Module → Feed Forward Module → Layernorm. These chain of operations can be formulated as:</p>

\[\begin{aligned}
x_{2} &amp;= x_{1} + 0.5*FFN_{1}(x_{1})\\
x_{3} &amp;= x_{2} + MHSA(x_{2})\\
x_{4} &amp;= x_{3} + Conv(x_{3})\\
x_{5} &amp;= x_{4} + 0.5*FFN_{2}(x_{4})\\
x_{6} &amp;= Layernorm(x_{5})
\end{aligned}\]

<p>One sentence for the Conformer block at first glance: it uses two half-weighted FFNs to sandwich the MHSA (from Transformer) and Conv module applied in sequence inside.</p>

<p>In the following, I dive into several components where I think the design choices deserve closer attention. For modules that are largely similar to their vanilla versions, I will skip them.</p>

<p><br /></p>

<h1 id="ffn">FFN</h1>

<p><img src="/images/posts/conformer/image 1.png" alt="FFN Module" /></p>

<p>The precise sequence of the FFN is: <strong>LayerNorm → Linear(d_model, 4·d_model) → Swish → Dropout → Linear(4·d_model, d_model) → Dropout</strong>, with a residual connection around the entire module.</p>

<p>The noteworthy part is that it takes an inverted bottleneck structure, where it provides a <strong>larger representational space</strong> for the nonlinear activation to operate in.</p>

<p><br /></p>

<h1 id="convolution-module">Convolution Module</h1>

<p><img src="/images/posts/conformer/image 2.png" alt="Convolution Module" /></p>

<p>The precise sequence of the Convolution Module is: <strong>LayerNorm → Pointwise Conv(d_model, 2·d_model) → GLU → Depthwise Conv(d_model, d_model, kernel=31) → BatchNorm → Swish → Pointwise Conv(d_model, d_model) → Dropout</strong>, with a residual connection around the entire module.</p>

<p>The convolution module employs a depthwise separable convolution, preceded by a pointwise convolution that expands the channel dimension to 2× d_model for GLU activation. GLU splits the expanded tensor into two halves and computes <code class="language-plaintext highlighter-rouge">σ(gate)⊙value</code>, providing learnable channel-wise gating.</p>

<p><br /></p>

<h1 id="inference-flow">Inference Flow</h1>

<p>Understanding how the data went through each operation in a model is important to have a more concrete understanding of the model, rather than just have an intuitive but blurry understanding. In the following, I would take you experience how the data shape changes when using Conformer in real ASR applications.</p>

<hr />

<h3 id="1-raw-audio-input">1. Raw Audio Input</h3>

<p>Everything starts with a raw waveform. In a typical ASR pipeline, you receive audio sampled at <strong>16,000 Hz</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input waveform: (B, T_samples)
  e.g. (4, 64000)  →  4 utterances, each 4 seconds long
</code></pre></div></div>

<hr />

<h3 id="2-feature-extraction--log-mel-spectrogram">2. Feature Extraction — Log-Mel Spectrogram</h3>

<p>The raw waveform is converted to a log-mel spectrogram using a Short-Time Fourier Transform (STFT) with, for example, a 25ms window and 10ms hop.</p>

<ul>
  <li><strong>Frames</strong> = <code class="language-plaintext highlighter-rouge">T_samples / hop_length</code> ≈ <code class="language-plaintext highlighter-rouge">64000 / 160</code> = 400 frames</li>
  <li><strong>Mel bins</strong> = 80 (standard in ESPnet / WeNet setups)</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After feature extraction: (B, T, F)
  e.g. (4, 400, 80)
</code></pre></div></div>

<hr />

<h3 id="3-specaugment-training-only">3. SpecAugment (Training Only)</h3>

<p>Time and frequency masks are applied to the feature tensor. The <strong>shape does not change</strong> — values are just zeroed out in certain bands.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After SpecAugment: (4, 400, 80)   ← same shape
</code></pre></div></div>

<hr />

<h3 id="4-subsampling-conv2d-subsampler">4. Subsampling (Conv2D Subsampler)</h3>

<p>To reduce the sequence length (which is expensive for attention), a Conv2D subsampling module is applied — typically with stride 2 twice, giving a <strong>4× reduction</strong>.</p>

<p>The feature map is first treated as a 2D image <code class="language-plaintext highlighter-rouge">(T, F)</code>, convolved, then reshaped into a 1D sequence projected to the model dimension <code class="language-plaintext highlighter-rouge">d_model</code> (e.g. 256).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Before subsampling: (4, 400, 80)
Add channel dimension: (4, 1, 400, 80)
Conv2d subsampling: (4, 256, 100, 20)   ← more channels, fewer frequency bins and time frames
Reshape: (4, 256, 100, 20) → (4, 100, 256×20) → (4, 100, 5120)
Linear projection: (4, 100, 256)   ← T//4, d_model
</code></pre></div></div>

<p>This is why ASR Conformers are tractable — attention runs over 100 frames, not 400.</p>

<hr />

<h3 id="5-positional-encoding">5. Positional Encoding</h3>

<p>A sinusoidal (or relative) positional encoding of shape <code class="language-plaintext highlighter-rouge">(1, T', d_model)</code> is <strong>added</strong> to the sequence. Shape is unchanged.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After positional encoding: (4, 100, 256)
</code></pre></div></div>

<hr />

<h3 id="6-conformer-block-n">6. Conformer Block (×N)</h3>

<p>Each Conformer block is composed of four sub-modules in sequence. Let’s trace shape through <strong>one block</strong>:</p>

<h3 id="6a-feed-forward-module-first-half-scale-">6a. Feed-Forward Module (first half, scale ½)</h3>

<p>A two-layer FFN with expansion factor 4:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input          : (4, 100, 256)
After Linear_1 : (4, 100, 1024)   ← expand
After Swish    : (4, 100, 1024)
After Dropout  : (4, 100, 1024)
After Linear_2 : (4, 100, 256)    ← project back
</code></pre></div></div>

<h3 id="6b-multi-head-self-attention-module">6b. Multi-Head Self-Attention Module</h3>

<p>With <code class="language-plaintext highlighter-rouge">num_heads = 4</code> and <code class="language-plaintext highlighter-rouge">d_model = 256</code>, each head has <code class="language-plaintext highlighter-rouge">d_k = 64</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input          : (4, 100, 256)
Q, K, V        : each (4, 4, 100, 64)   ← (B, heads, T, d_k)
Attention scores: (4, 4, 100, 100)
After softmax  : (4, 4, 100, 100)
Context        : (4, 4, 100, 64)
After reshape  : (4, 100, 256)
After out proj : (4, 100, 256)
</code></pre></div></div>

<h3 id="6c-convolution-module">6c. Convolution Module</h3>

<p>A depthwise convolution with kernel size 31 operates along the time axis:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input               : (4, 100, 256)
After pointwise_1   : (4, 100, 512)   ← GLU doubles channels
After GLU           : (4, 100, 256)   ← halves back
After depthwise conv: (4, 100, 256)   ← kernel=31, same padding
After BatchNorm     : (4, 100, 256)
After Swish         : (4, 100, 256)
After pointwise_2   : (4, 100, 256)
</code></pre></div></div>

<h3 id="6d-feed-forward-module-second-half-scale-">6d. Feed-Forward Module (second half, scale ½)</h3>

<p>Same as 6a. Output shape stays <code class="language-plaintext highlighter-rouge">(4, 100, 256)</code>.</p>

<p>After all N=12 Conformer blocks (typical for medium-size models), the shape is still:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After N Conformer blocks: (4, 100, 256)
</code></pre></div></div>

<hr />

<h3 id="7-ctc--attention-decoder-head">7. CTC / Attention Decoder Head</h3>

<p>Depending on the decoding strategy:</p>

<p><strong>CTC Head</strong> — a linear projection over the vocabulary (e.g. 5000 BPE tokens):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After Linear  : (4, 100, 5000)
After LogSoftmax: (4, 100, 5000)   ← per-frame token log-probs
</code></pre></div></div>

<p><strong>Attention Decoder</strong> — an autoregressive Transformer decoder cross-attending to the encoder output, producing one token at a time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Encoder output : (4, 100, 256)
Decoder input  : (4, L_text, 256)   ← L_text = target length
Cross-attention: keys/values from encoder, queries from decoder
Final output   : (4, L_text, 5000)
</code></pre></div></div>

<hr />

<h3 id="summary-table">Summary Table</h3>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>Shape</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Raw waveform</td>
      <td><code class="language-plaintext highlighter-rouge">(B, T_samples)</code></td>
    </tr>
    <tr>
      <td>Log-Mel features</td>
      <td><code class="language-plaintext highlighter-rouge">(B, T, 80)</code></td>
    </tr>
    <tr>
      <td>After subsampling</td>
      <td><code class="language-plaintext highlighter-rouge">(B, T/4, 256)</code></td>
    </tr>
    <tr>
      <td>After each Conformer block</td>
      <td><code class="language-plaintext highlighter-rouge">(B, T/4, 256)</code></td>
    </tr>
    <tr>
      <td>CTC output</td>
      <td><code class="language-plaintext highlighter-rouge">(B, T/4, vocab_size)</code></td>
    </tr>
    <tr>
      <td>Decoder output</td>
      <td><code class="language-plaintext highlighter-rouge">(B, L_text, vocab_size)</code></td>
    </tr>
  </tbody>
</table>

<hr />

<p>The key insight is that the <strong>sequence length shrinks early</strong> (at the subsampler) and then <strong>stays constant</strong> all the way through the Conformer stack — this is what makes the self-attention computationally feasible. The model dimension <code class="language-plaintext highlighter-rouge">d_model</code> is similarly fixed throughout, acting as a consistent “information highway” between modules.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="speech recognition" /><category term="deep learning" /><category term="transformer" /><summary type="html"><![CDATA[A deep dive into the Conformer architecture, which fuses CNNs and Transformers for state-of-the-art automatic speech recognition.]]></summary></entry><entry><title type="html">I Tested AI Detectors on My Daily Feeds — Here’s What I Found</title><link href="https://zhs2326.github.io/posts/2025/11/blog-post-1/" rel="alternate" type="text/html" title="I Tested AI Detectors on My Daily Feeds — Here’s What I Found" /><published>2025-11-09T00:00:00+00:00</published><updated>2025-11-09T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2025/11/I%20Tested%20AI%20Detectors%20on%20My%20Daily%20Feeds%20%E2%80%94%20Here%E2%80%99s%20What%20I%20Found</id><content type="html" xml:base="https://zhs2326.github.io/posts/2025/11/blog-post-1/"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>With AI-generated content becoming increasingly common these days (even this blog post has been processed by AI), I’ve started wondering how much of what I read every day is actually produced by AI — and whether I still have the ability or tools to tell AI-generated content apart from human-written ones.</p>

<p>To explore this, I ran a small experiment. I chose three mainstream content platforms: <strong>X</strong>, <strong>LinkedIn</strong>, and <strong>Medium</strong>. Over the course of five days, I randomly selected three posts I encountered on each platform each day, then rated how likely I thought each piece was AI-generated.</p>

<p>In addition, I used five popular AI-text detectors — <strong>GPTZero</strong>, <strong>Originality.AI</strong>, <strong>Quillbot</strong>, <strong>GPT2 Output Detector</strong>, and <strong>GLTR</strong> — to evaluate the same content. While I didn’t have the ground truth (since I didn’t ask the authors whether they used AI), it was still fascinating to see how much of my daily input might be influenced by AI. It also made me wonder whether different platforms have different proportions of AI-generated content — and if so, whether we should start seeking out “high human-to-AI ratio” platforms when authenticity really matters.</p>

<p>Before diving into the results, I’ll first give a brief overview of how the public AI-text detectors that I use (GPT2 Output Detector and GLTR) actually work and how I gave my score.</p>

<h1 id="gpt2-output-detector">GPT2 Output Detector</h1>

<p>The GPT-2 Output Detector is a Roberta-based model designed for sentence classification. It is trained to distinguish between real human-generated text and text generated by GPT-2. The model outputs a probability indicating the likelihood that a given input sequence was produced by a human rather than by the GPT-2 language model.</p>

<p>For further details, you can refer to the following resources: <a href="https://openai.com/index/gpt-2-1-5b-release/">OpenAI’s GPT-2 1.5B release</a>, the <a href="https://arxiv.org/pdf/1908.09203">research paper</a>, and the <a href="https://github.com/openai/gpt-2-output-dataset/tree/master/detector">GPT-2 Output Dataset repository</a>.</p>

<h1 id="gltr">GLTR</h1>

<p>GLTR (Giant Language Model Test Room) leverages language models, like GPT-2, to detect AI-generated text. It is based on the idea that language generation models tend to select the most statistically likely words, while human-written text often exhibits more variation. If a paragraph contains words that are consistently high-probability choices for a language model, it is more likely to be AI-generated. GLTR visualizes the probability of each word being part of a GPT-2 output, helping to distinguish between human-written and AI-generated text.</p>

<p>For more information on GLTR, visit <a href="http://gltr.io/">GLTR.io</a> or check out the <a href="https://arxiv.org/pdf/1906.04043">research paper</a>.</p>

<p>In the following analysis, I used the <strong>ratio of “green” words to total words</strong> in GLTR as its estimate of AI-generation probability. For the other AI detectors, I simply used the scores or probabilities they provided to represent how likely the text was generated by AI.</p>

<h1 id="human-evaluation">Human Evaluation</h1>

<p>All the human scores in this analysis were rated by me. For scoring, I adopted the <strong>Mean Opinion Score (MOS)</strong> approach: if I believed a piece was definitely written by AI, I gave it a <strong>5</strong>; if I thought it was definitely human-written, I gave it a <strong>1</strong>. These scores were later normalized to a 0–1 range for analysis.</p>

<p>(One thing worth noting is that I’m not a native English speaker, which might slightly influence my judgment when evaluating whether something “feels” AI-generated.)</p>

<p>Now, let’s move on to the results. First, we’ll look at the <strong>percentage of AI-generated content</strong> on each platform — as evaluated by each detector and by myself.</p>

<h1 id="experimental-results">Experimental Results</h1>

<p><img src="/images/posts/I-Tested-AI-Detectors-on-My-Daily-Feeds-Heres-What-I-Found/image.png" alt="image.png" /></p>

<p><img src="/images/posts/I-Tested-AI-Detectors-on-My-Daily-Feeds-Heres-What-I-Found/image 1.png" alt="image.png" /></p>

<p><img src="/images/posts/I-Tested-AI-Detectors-on-My-Daily-Feeds-Heres-What-I-Found/image 2.png" alt="image.png" /></p>

<p>We can see that <strong>Quillbot</strong> and the <strong>GPT-2 Output Detector</strong> (especially the latter) show almost no ability to distinguish between AI-generated and human-written content. Their outputs are highly concentrated, consistently producing AI probabilities close to zero. This isn’t too surprising for the GPT-2 Output Detector — after all, as of <strong>November 2025</strong>, very few AI-generated texts today are actually produced by GPT-2. However, it’s somewhat surprising that <strong>Quillbot</strong>, a commercial product, performs in a similar way.</p>

<p>On the other hand, <strong>GPTZero</strong> displays a noticeably wider score distribution. While that doesn’t necessarily mean it’s more accurate (since I don’t have the ground truth), it does suggest that GPTZero attempts a more fine-grained evaluation of how “AI-like” a text might be compared to the other detectors.</p>

<p>Next, let’s look at some statistics and put the results of all three platforms together.</p>

<table>
  <thead>
    <tr>
      <th>Detector</th>
      <th>Twitter</th>
      <th>LinkedIn</th>
      <th>Medium</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPTZero</td>
      <td>0.218667 / 0.403111 / 0.380635</td>
      <td>0.3 / 0.425357 / 0.511976</td>
      <td>0.533333 / 0.41579 / 0.904627</td>
    </tr>
    <tr>
      <td>Originality.ai</td>
      <td>0.066667 / 0.255445 / 0.149693</td>
      <td>0.192 / 0.391174 / -0.461454</td>
      <td>0.294667 / 0.432119 / 0.697721</td>
    </tr>
    <tr>
      <td>Quillbot</td>
      <td>0.0 / 0.0 / 0.0</td>
      <td>0.088 / 0.256549 / 0.299827</td>
      <td>0.064 / 0.186616 / 0.206284</td>
    </tr>
    <tr>
      <td>GPT2Output</td>
      <td>0.000233 / 6.2e-05 / 0.968864</td>
      <td>0.03834 / 0.147356 / 0.250164</td>
      <td>0.00318 / 0.011459 / 0.203963</td>
    </tr>
    <tr>
      <td>GLTR</td>
      <td>0.674333 / 0.066996 / -0.229616</td>
      <td>0.641067 / 0.066023 / -0.437913</td>
      <td>0.6584 / 0.033645 / 0.341786</td>
    </tr>
    <tr>
      <td>Human</td>
      <td>0.48 / 0.248424 / 1.0</td>
      <td>0.52 / 0.211119 / 1.0</td>
      <td>0.413333 / 0.24456 / 1.0</td>
    </tr>
  </tbody>
</table>

<p>Each cell in the table below shows the <strong>mean</strong>, <strong>standard deviation</strong>, and <strong>correlation with human scores</strong> (my ratings) for each detector.</p>

<p>From these results, we can see that <strong>GPTZero</strong> shows a noticeably strong correlation with my human ratings. Interestingly, even though the <strong>GPT-2 Output Detector</strong> almost always reports very low AI probabilities, its results still correlate fairly well with my scores.</p>

<p>When comparing across platforms, the overall proportion of AI-generated content appears quite similar among the three. This suggests that AI-assisted writing has become pervasive across mainstream platforms — and perhaps serves as a reminder to question whether what we’re reading truly reflects someone’s original thought, even when it’s posted under their name.</p>

<h1 id="conclusion">Conclusion</h1>

<p>This post is just a small experiment exploring how much of the content I encounter daily might be AI-generated — and how different AI detectors perform on those samples. Personally, I think <strong>GPTZero</strong> outperforms the others, even though its underlying principles aren’t publicly known.</p>

<p>That said, as generative AI continues to evolve, it will inevitably become harder for humans to distinguish AI-generated content from human-written text. And as AI detectors improve, AI models can, in turn, learn from them and evolve even further.</p>

<p>But is it necessarily a bad thing if one day most of what we read is created by AI — and we can no longer tell the difference? Maybe not. If AI eventually behaves like it reaches or surpasses human-level intelligence, its outputs might actually become more insightful and valuable than many human-written pieces. At that point, AI-generated content could even become something we learn from, rather than something we try to detect and then ignore.</p>

<p>I find that idea a little scary — but also incredibly exciting.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="ai detector" /></entry><entry><title type="html">Paper Reading: Progressive Neural Networks</title><link href="https://zhs2326.github.io/posts/2025/07/blog-post-2/" rel="alternate" type="text/html" title="Paper Reading: Progressive Neural Networks" /><published>2025-07-07T00:00:00+00:00</published><updated>2025-07-07T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2025/07/Progressive%20Neural%20Networks</id><content type="html" xml:base="https://zhs2326.github.io/posts/2025/07/blog-post-2/"><![CDATA[<h1 id="progressive-neural-networks">Progressive Neural Networks</h1>

<p><strong>Citation Count:</strong> 3443
<strong>Organization:</strong> DeepMind
<strong>Year:</strong> 2016</p>

<h2 id="background">Background</h2>

<p>The paper <em>Progressive Neural Networks</em> is one of the earliest and most influential works in the field of continual learning. Although its core idea is simpler compared to other foundational papers—such as <em>Overcoming Catastrophic Forgetting in Neural Networks</em> (EWC)—it represents a different category of approaches. Instead of using regularization to prevent forgetting, it tackles the problem by <strong>expanding the network</strong> and <strong>adding new parameters</strong> when learning new tasks. Let’s take a closer look at how it works!</p>

<h2 id="method">Method</h2>

<p>If you’re familiar with regularization-based methods, you’ll recall that their key idea is to constrain the model—typically through some form of penalty—so it doesn’t forget previously learned tasks when adapting to new ones. A hidden assumption in that approach is the desire to <strong>keep model complexity fixed</strong>.</p>

<p>But what if model size isn’t a hard constraint? If we allow the network to grow, we can simply <strong>add new capacity</strong> for new tasks and preserve the existing knowledge as-is. This is the central idea behind <em>Progressive Neural Networks</em>.</p>

<p>Each <strong>column</strong> in the below architecture represents a sub-network dedicated to a specific task. Initially, only the first column exists to handle the first task. Whenever a new task exists, a new column is added to learn that task.</p>

<p><img src="/images/posts/progressive-neural-networks/image.png" alt="Architecture diagram of Progressive Neural Networks" /></p>

<p>Here are a few key points worth noting:</p>

<ol>
  <li><strong>Knowledge reuse:</strong> Unlike training a completely separate model for each new task, <em>Progressive Neural Networks</em> allow for <strong>knowledge transfer</strong> via <strong>lateral connections</strong>. These connections link each layer in the new column to the previous layers in trained columns, enabling the model to reuse useful representations.</li>
  <li><strong>No forgetting:</strong> When learning new tasks, the parameters of previously trained columns are <strong>frozen</strong>. This is the fundamental reason why catastrophic forgetting does not occur.</li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p><em>Progressive Neural Networks</em> is a relatively easy-to-understand paper with clear intuition and straightforward implementation. Still, it’s important to appreciate why it belongs in the continual learning family and how its design philosophy differs from regularization-based approaches. While it trades off parameter efficiency for simplicity and robustness to forgetting, it remains a foundational method in the field—and an important one to understand.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="continual learning" /></entry><entry><title type="html">Paper Reading: Overcoming Catastrophic Forgetting in Neural Network</title><link href="https://zhs2326.github.io/posts/2025/06/blog-post-2/" rel="alternate" type="text/html" title="Paper Reading: Overcoming Catastrophic Forgetting in Neural Network" /><published>2025-06-29T00:00:00+00:00</published><updated>2025-06-29T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2025/06/Overcoming%20Catastrophic%20Forgetting%20in%20Neural%20Network</id><content type="html" xml:base="https://zhs2326.github.io/posts/2025/06/blog-post-2/"><![CDATA[<h2 id="background">Background</h2>

<p>One way to address catastrophic forgetting is to introduce regularization when learning new tasks. The idea is straightforward: continue learning the new task while applying regularization to prevent the model from forgetting what it learned from previous tasks. The tricky part is figuring out <em>how</em> to apply this regularization effectively. The paper <em>“Overcoming Catastrophic Forgetting in Neural Networks”</em> proposes a method called <strong>Elastic Weight Consolidation (EWC)</strong>, which offers one such solution.</p>

<h2 id="fisher-information">Fisher Information</h2>

<p>From my point of view, the core idea of <strong>Elastic Weight Consolidation (EWC)</strong> is grounded in <strong>Fisher Information</strong>. If you develop a solid understanding of Fisher Information, you’ve essentially grasped 90% of the paper. However, the concept can be quite abstract, especially for those without a strong background in statistics. Personally, despite having a basic understanding of probability, it took me over 10 hours to truly grasp the foundational idea of Fisher Information.</p>

<p>In the following, I’ll walk you through Fisher Information from a beginner’s point of view. Some terminology may be simplified and not strictly rigorous, but the explanation should be sufficient to help you understand the core intuition behind the paper.</p>

<h3 id="problem-to-solve">Problem to Solve</h3>

<p>Let’s begin with the problem context. Imagine we have a set of data points that are assumed to come from a known family of probability distributions. Our goal is to estimate some unknown parameters of this distribution using the observed data.</p>

<p>For example, suppose the data points are drawn from a Gaussian distribution, but we don’t know the mean or variance. How precise can our estimation of the parameters can be, given the data? <strong>Fisher Information</strong> provides a way to quantify that.</p>

<p>But before we define Fisher Information, let’s revisit a fundamental concept: <strong>likelihood</strong>.</p>

<h3 id="likelihood">Likelihood</h3>

<p>Most of us are familiar with <strong>probability</strong>. Given a distribution parameterized by some variable $\theta$, the probability tells us how likely a specific data point $x$ is under that distribution. If the data point is continuous, we can represent its probability using a probability density function, denoted as $p_{\theta}(x)$. In this case, $x$ is the variable, and $\theta$ is fixed.</p>

<p>In contrast, <strong>likelihood</strong> flips the perspective. Given an observed data point $x$, likelihood asks: <em>How does the probability of observing this data change as the parameter</em> $\theta$ <em>varies?</em> There are two things noteworthy here:</p>

<ul>
  <li>Likelihood is a <strong>function of the parameter</strong> $\theta$, with the data $x$ fixed.</li>
  <li>It is usually <strong>proportional</strong> to the probability density evaluated at $x$ for a given $\theta$.</li>
</ul>

<p>Mathematically, we write the likelihood as: $L(\theta\vert x) = p_{\theta}(x)$. Even though it looks the same as the PDF, the interpretation is different: here, we treat $x$ as fixed and view $L$ as a function of $\theta$. With a basic understanding of likelihood in place, let’s now return to the concept of <strong>Fisher Information</strong>.</p>

<h3 id="now-to-fisher-information">Now to Fisher Information</h3>

<p>Mathematically, <strong>Fisher Information</strong> is denoted as:</p>

\[I_{X}(\theta) = \int_x{p(x|\theta) \ (\frac{d}{d\theta}log(p(x|\theta))^2\ dx}\]

<p>This formula can be broken down into two main parts:</p>

<ol>
  <li>The integrand: $(\frac{d}{d\theta}log(p(x \vert \theta))^2$, which is the square of the derivative of the log-likelihood with respect to the parameter $\theta$.</li>
  <li>The integration: Taking the expectation over all possible values of $x$, weighted by their likelihood under the current parameter $\theta$.</li>
</ol>

<p>The integration is straight forward while the integrand is complicate. Let’s unpack the intuition behind the integrand step-by-step.</p>

<ul>
  <li>First, note that the <strong>log</strong> is used for mathematical convenience—it turns products into sums and simplifies derivatives. Since it’s a monotonic function, it doesn’t fundamentally change where the likelihood peaks or how sensitive it is to $\theta$. So for intuition, we can temporarily set aside the log and focus on the likelihood itself.</li>
  <li>Next, the <strong>derivative</strong> of the likelihood (or log-likelihood) with respect to $\theta$ tells us <strong>how sensitive</strong> the probability of observing data point $x$ is to changes in the parameter. In other words, it measures how much the likelihood shifts when $\theta$ is nudged.</li>
  <li>Then, the <strong>square</strong> of the derivative simply removes the sign and emphasizes the magnitude—ensuring that both increases and decreases in likelihood contribute positively to the overall measure.</li>
</ul>

<p>Putting this together:</p>

<p>If the derivative is close to <strong>zero</strong>, it means that small changes in $\theta$ don’t affect the likelihood much. This suggests that $x$ doesn’t contain much information about $\theta$ at that value because nearby values are almost the same likely to generate the same $x$, and thus, we can’t estimate $\theta$ very precisely based on $x$—the <strong>Fisher Information is low</strong>. Conversely, if the derivative is <strong>large</strong>, the likelihood is very sensitive to $\theta$, meaning $x$ can locate $\theta$ well—the <strong>Fisher Information is high</strong>.</p>

<p>Now, I think you’ve got a basic understanding of why Fisher Information can manifest the sensitivity of data $x$ to parameter $\theta$.</p>

<h2 id="back-to-ewc">Back to EWC</h2>

<p><img src="/images/posts/overcoming-catastrophic-forgetting-in-neural-networks/image.png" alt="image.png" /></p>

<p>Let’s return to EWC. At its core, <strong>Elastic Weight Consolidation (EWC)</strong> is based on the above loss function, which consists of two components. The first is the standard loss for task B (the new task), and the second is a regularization term. In this term, $F$ denotes the <strong>Fisher Information Matrix</strong>, $\theta$ represents the current weights being optimized, and $\theta_A^*$ refers to the weights previously learned for task A (the old task). $\lambda$ is a hyperparameter that controls the strength of the regularization, and $i$ indexes the model parameters.</p>

<p>In other words, weights associated with higher Fisher Information—meaning small changes in them would significantly affect the likelihood—are considered more critical for preserving performance on previous tasks. EWC penalizes changes to these important weights more heavily, thereby preventing the model from “forgetting” them when learning new tasks. This is how Fisher Information is used to guide the regularization in EWC and mitigate catastrophic forgetting.</p>

<p>The paper also provides a theoretical justification for the form of the loss function, even though the intuition is quite clear. For a more formal understanding, you can refer to Equations (1) and (2) in the paper, which are relatively straightforward to derive.</p>

<p>In the experimental section, the authors evaluate EWC on both a classic benchmark—Permuted MNIST—and reinforcement learning tasks. The results show that EWC outperforms vanilla SGD and L2 regularization in terms of maintaining performance across tasks. We won’t go into the experimental details here.</p>

<h2 id="conclusion">Conclusion</h2>

<p><em>“Overcoming Catastrophic Forgetting in Neural Networks”</em> is a classic paper in the field of continual learning. It presents a clear narrative, combining intuitive ideas with solid mathematical foundations, and supports them with well-designed experiments that demonstrate the method’s effectiveness. While fully grasping every detail may require a strong background in statistics or some additional reading, the paper is absolutely worth revisiting multiple times to understand the reasoning behind its approach.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="continual learning" /></entry><entry><title type="html">Paper Reading: Learning without Forgetting</title><link href="https://zhs2326.github.io/posts/2025/06/blog-post-1/" rel="alternate" type="text/html" title="Paper Reading: Learning without Forgetting" /><published>2025-06-01T00:00:00+00:00</published><updated>2025-06-01T00:00:00+00:00</updated><id>https://zhs2326.github.io/posts/2025/06/learning-without-forgetting</id><content type="html" xml:base="https://zhs2326.github.io/posts/2025/06/blog-post-1/"><![CDATA[<h1 id="learning-without-forgetting">Learning without Forgetting</h1>

<p><strong>Citation Count:</strong> 5576<br />
<strong>Main Idea:</strong> Use old model’s prediction on new data to regularize the network from forgetting old tasks when learning new tasks (under the limitation that has no access to previous data)<br />
<strong>Organization:</strong> UIUC<br />
<strong>Year:</strong> 2016</p>

<h2 id="paper-link">Paper Link</h2>

<p><a href="https://arxiv.org/pdf/1606.09282">https://arxiv.org/pdf/1606.09282</a></p>

<h2 id="background">Background</h2>

<p>There are three common approaches to learning new classification tasks while maintaining performance on previously learned ones, each with distinct advantages and disadvantages:</p>

<ol>
  <li>Feature extraction: The original network’s feature extraction and classifier components remain fixed, while a new classifier is added to the feature extractor’s output to learn new tasks.
    <ul>
      <li>Best performance on original tasks</li>
      <li>Suboptimal performance on new tasks since features are optimized for old tasks</li>
    </ul>
  </li>
  <li>Fine Tuning: Both the original network’s feature extractor and the new classifier are tuned for the new task.
    <ul>
      <li>Good performance on new tasks</li>
      <li>Degraded performance on old tasks as features become optimized for new tasks</li>
    </ul>
  </li>
  <li>Joint Training: Data from both old and new tasks are combined for training, with the feature extractor and all classifiers being tuned simultaneously.
    <ul>
      <li>Good performance on both old and new tasks</li>
      <li>Requires access to old task data</li>
      <li>Longer training time due to increased data volume and full network retraining</li>
    </ul>
  </li>
</ol>

<p>To combine the advantages and eliminate the disadvantages of these methods, the authors proposed <strong>Learning without Forgetting (LWF)</strong>, which offers:</p>

<ol>
  <li>Good performance on both new and old tasks</li>
  <li>No need for old task data</li>
  <li>Training time comparable to fine-tuning when learning new tasks</li>
</ol>

<h2 id="main-idea">Main Idea</h2>

<p>While the original paper contains many details, the core concept is straightforward:</p>

<p><strong>Joint training achieves good performance on both old and new tasks but requires old data. To overcome this limitation, we can use an alternative approach: have the original network classify new data for old tasks, recording these outputs as a representation of the original network. Then, while learning new tasks, train the network to maintain these recorded outputs (feature extractor + old classifier) while simultaneously learning the new tasks (feature extractor + new classifier).</strong></p>

<p>The core idea can be expressed mathematically as follows:</p>

<p><img src="/images/posts/learning-without-forgetting/image.png" alt="LWF Mathematical Formulation" /></p>

<p>where $s$ represents the feature extractor, $o$ relates to the old classifier, $s$ relates to the new classifier, and $R$ represents regularization terms.</p>

<h2 id="experimental-results">Experimental Results</h2>

<p>The experimental results demonstrate that LWF achieves a good balance between old and new tasks, as illustrated below:</p>

<p><img src="/images/posts/learning-without-forgetting/image 1.png" alt="LWF Experimental Results" /></p>

<p>While most ablation studies yield minor insights, one notable finding is that task dissimilarity affects performance: when new tasks differ significantly from old ones, performance on old tasks deteriorates across almost all methods. This phenomenon is also mentioned by other papers.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Learning without Forgetting stands as a seminal paper in continual learning. Though its method and performance may seem modest by today’s standards, it remains widely cited in subsequent research. Understanding this work provides crucial insights into basic approaches for preventing catastrophic forgetting. When studied alongside other key papers in continual learning, it helps illuminate the field’s evolution and underlying principles.</p>]]></content><author><name>Haoshuai Zhou</name><email>haoshuaizhou97@gmail.com</email></author><category term="continual learning" /></entry></feed>