Hi, I’m Haoshuai Zhou!

Reflections as a Newbie Tech Lead

2026-04-11T00:00:00+00:00

I originally wrote this post in Chinese on Xiaohongshu (小红书) in February 2025, where it received a wonderful response with many likes and saves. To reach a wider audience, I’ve translated it into English and shared it here. I hope you find it useful!

Since Jul 2021, I’ve been working for three and a half years, and I’ve spent nearly 1.5 of those years in a tech lead role. Over the past six months, I’ve gradually transitioned into a team lead position, learning a lot — both technical and non-technical. Here, from the perspective of a technically-oriented newbie leader, I’d like to share some of my takeaways.

1. Keeping Your Technical Skills Sharp Is Not Easy for a New Tech Lead

Transitioning into a tech lead role usually happens because of strong technical ability. But for newcomers — especially younger ones — their technical edge often comes more from grinding hard than from deep accumulated experience. Once you become a tech lead, the demand for technical breadth skyrockets: you need to help team members solve all kinds of problems. The pace of technical growth and problem-solving ability that was previously “good enough” may no longer cut it. At this point, you must consciously push yourself to learn new technologies faster, dive into unfamiliar problems more efficiently, and refine the small habits that impact your effectiveness.

2. Be the First to Explore New Platforms and Technologies

In my view, when a new tech lead encounters a new platform or technology, they should always be the first to jump in. On one hand, pioneering new platforms is inherently difficult — this kind of tough nut needs to be cracked by the tech lead. A tech lead should leverage their sharp technical instincts to help the team navigate early pitfalls and identify key issues, so that when it’s handed off to others, only minimal follow-up is needed. On the other hand, if you don’t understand the platform yourself, trying to step in and direct things later becomes very difficult. People may see you as out of your depth and lose confidence in you — especially if you’re a young tech lead.

3. Don’t Be Afraid of Conflict

If a tech lead could just stay at the technical level and let the work speak for itself (which is already hard enough), then as a team lead, trying to completely avoid conflict is nearly impossible. In fact, since I became a tech lead, almost everyone on the team has had arguments with me at some point — including my own leader. In my view, a team without disagreements is a team without vitality. If everyone always agrees, what do you even need a whole team for? What matters is how you navigate or even harness conflict to lead people toward doing the right — and often the hard — thing. Behind every argument is a difference of opinion. How to use disagreements to draw out diverse perspectives and ultimately build alignment is something every leader must continuously think about and practice.

4. Staying Approachable Is Crucial

Until a few months ago, I loved using my so-called “sharp” thinking to immediately point out problems the moment someone shared an idea, and then offer a more efficient path. Gradually, I realized this was making everyone conform to my way of thinking — the team was becoming a clone of me. When people talked to me, their reasoning was carefully curated; they would hide the uncertain parts, the parts most likely to expose mistakes. This is dangerous. Many critical details, potential opportunities, and shifts in team morale become invisible to you. And your relationships suffer too — people don’t even want to chat with you outside of work. So I’ve come to realize that staying deliberately approachable is essential. Don’t immediately shoot down others’ ideas. Don’t fixate on small mistakes. Never make people afraid to express what they truly think in front of you. Let people feel safe to communicate, safe to make mistakes, and free to have their own unique space — that’s what gives a team real energy.

5. Personal Character Matters — A Lot

I once saw an image that vividly illustrated the difference between a boss and a leader: the boss sits on a carriage, whipping subordinates to pull the cart, while the leader stands at the front, pulling alongside everyone. Frankly, I have no respect for the “boss” type. I believe everyone should aspire to be a leader, not a boss.

Even though I’m the team lead, I’m actually the second youngest person on the team — the oldest member is nearly 20 years my senior. Given my experience, encountering resistance while leading the team is perfectly normal. Fortunately, this pushed me to start practicing influence without authority early on. I believe authority can be a useful tool for a leader, but relying solely on authority to lead people is a recipe for total failure. The moment you leave that position, you’ll quickly learn what it feels like when people treat you differently.

A good leader should consciously avoid leveraging the privileges of their position. Instead, understand people’s needs, find common interests between individuals and the team, and make people want to embrace your ideas and create greater value together. Beyond that, lead by example — teach through competence, inspire through conduct. I believe a leader can be lenient with others but must be strict with themselves. That’s why I almost never skip morning standup. When the team has an important task and teammates are there, I won’t leave either — even if I’m just sitting there, I’ll stay.

Finally, being a leader is just one role at work. In life, you absolutely cannot expect to lead everyone’s thoughts and emotions at all times. If you can only find your sense of purpose at work, you’ll end up losing more than you gain in life. Every person is a vivid, independent individual. Through working together, I’ve come to see the real sides of everyone — their sensitivity, vulnerability, strength, resilience, selfishness, selflessness… and they can see mine too. Truly appreciating the uniqueness of every person around you, and making people want to work with you because of who you are — that’s what matters most.

Conformer: Combining CNNs and Transformers for Speech Recognition

2026-03-21T00:00:00+00:00

Conformer

Conformer is a model architecture popularly used in automatic speech recognition (ASR), which combines the strengths of CNN and Transformer by deeply integrating these two structures. To have a deep understanding of it, we should not only remember its components, but also understand why it’s designed in that way.

Conformer Block Overview

The core of the Conformer architecture is the conformer block, which essentially has 5 components: Feed Forward Module (FFN) → Multi-Head Self Attention Module (MHSA) → Convolution Module → Feed Forward Module → Layernorm. These chain of operations can be formulated as:

\[\begin{aligned} x_{2} &= x_{1} + 0.5*FFN_{1}(x_{1})\\ x_{3} &= x_{2} + MHSA(x_{2})\\ x_{4} &= x_{3} + Conv(x_{3})\\ x_{5} &= x_{4} + 0.5*FFN_{2}(x_{4})\\ x_{6} &= Layernorm(x_{5}) \end{aligned}\]

One sentence for the Conformer block at first glance: it uses two half-weighted FFNs to sandwich the MHSA (from Transformer) and Conv module applied in sequence inside.

In the following, I dive into several components where I think the design choices deserve closer attention. For modules that are largely similar to their vanilla versions, I will skip them.

FFN

The precise sequence of the FFN is: LayerNorm → Linear(d_model, 4·d_model) → Swish → Dropout → Linear(4·d_model, d_model) → Dropout, with a residual connection around the entire module.

The noteworthy part is that it takes an inverted bottleneck structure, where it provides a larger representational space for the nonlinear activation to operate in.

Convolution Module

The precise sequence of the Convolution Module is: LayerNorm → Pointwise Conv(d_model, 2·d_model) → GLU → Depthwise Conv(d_model, d_model, kernel=31) → BatchNorm → Swish → Pointwise Conv(d_model, d_model) → Dropout, with a residual connection around the entire module.

The convolution module employs a depthwise separable convolution, preceded by a pointwise convolution that expands the channel dimension to 2× d_model for GLU activation. GLU splits the expanded tensor into two halves and computes σ(gate)⊙value, providing learnable channel-wise gating.

Inference Flow

Understanding how the data went through each operation in a model is important to have a more concrete understanding of the model, rather than just have an intuitive but blurry understanding. In the following, I would take you experience how the data shape changes when using Conformer in real ASR applications.

1. Raw Audio Input

Everything starts with a raw waveform. In a typical ASR pipeline, you receive audio sampled at 16,000 Hz.

Input waveform: (B, T_samples)
  e.g. (4, 64000)  →  4 utterances, each 4 seconds long

2. Feature Extraction — Log-Mel Spectrogram

The raw waveform is converted to a log-mel spectrogram using a Short-Time Fourier Transform (STFT) with, for example, a 25ms window and 10ms hop.

Frames = T_samples / hop_length ≈ 64000 / 160 = 400 frames
Mel bins = 80 (standard in ESPnet / WeNet setups)

After feature extraction: (B, T, F)
  e.g. (4, 400, 80)

3. SpecAugment (Training Only)

Time and frequency masks are applied to the feature tensor. The shape does not change — values are just zeroed out in certain bands.

After SpecAugment: (4, 400, 80)   ← same shape

4. Subsampling (Conv2D Subsampler)

To reduce the sequence length (which is expensive for attention), a Conv2D subsampling module is applied — typically with stride 2 twice, giving a 4× reduction.

The feature map is first treated as a 2D image (T, F), convolved, then reshaped into a 1D sequence projected to the model dimension d_model (e.g. 256).

Before subsampling: (4, 400, 80)
Add channel dimension: (4, 1, 400, 80)
Conv2d subsampling: (4, 256, 100, 20)   ← more channels, fewer frequency bins and time frames
Reshape: (4, 256, 100, 20) → (4, 100, 256×20) → (4, 100, 5120)
Linear projection: (4, 100, 256)   ← T//4, d_model

This is why ASR Conformers are tractable — attention runs over 100 frames, not 400.

5. Positional Encoding

A sinusoidal (or relative) positional encoding of shape (1, T', d_model) is added to the sequence. Shape is unchanged.

After positional encoding: (4, 100, 256)

6. Conformer Block (×N)

Each Conformer block is composed of four sub-modules in sequence. Let’s trace shape through one block:

6a. Feed-Forward Module (first half, scale ½)

A two-layer FFN with expansion factor 4:

Input          : (4, 100, 256)
After Linear_1 : (4, 100, 1024)   ← expand
After Swish    : (4, 100, 1024)
After Dropout  : (4, 100, 1024)
After Linear_2 : (4, 100, 256)    ← project back

6b. Multi-Head Self-Attention Module

With num_heads = 4 and d_model = 256, each head has d_k = 64:

Input          : (4, 100, 256)
Q, K, V        : each (4, 4, 100, 64)   ← (B, heads, T, d_k)
Attention scores: (4, 4, 100, 100)
After softmax  : (4, 4, 100, 100)
Context        : (4, 4, 100, 64)
After reshape  : (4, 100, 256)
After out proj : (4, 100, 256)

6c. Convolution Module

A depthwise convolution with kernel size 31 operates along the time axis:

Input               : (4, 100, 256)
After pointwise_1   : (4, 100, 512)   ← GLU doubles channels
After GLU           : (4, 100, 256)   ← halves back
After depthwise conv: (4, 100, 256)   ← kernel=31, same padding
After BatchNorm     : (4, 100, 256)
After Swish         : (4, 100, 256)
After pointwise_2   : (4, 100, 256)

6d. Feed-Forward Module (second half, scale ½)

Same as 6a. Output shape stays (4, 100, 256).

After all N=12 Conformer blocks (typical for medium-size models), the shape is still:

After N Conformer blocks: (4, 100, 256)

7. CTC / Attention Decoder Head

Depending on the decoding strategy:

CTC Head — a linear projection over the vocabulary (e.g. 5000 BPE tokens):

After Linear  : (4, 100, 5000)
After LogSoftmax: (4, 100, 5000)   ← per-frame token log-probs

Attention Decoder — an autoregressive Transformer decoder cross-attending to the encoder output, producing one token at a time:

Encoder output : (4, 100, 256)
Decoder input  : (4, L_text, 256)   ← L_text = target length
Cross-attention: keys/values from encoder, queries from decoder
Final output   : (4, L_text, 5000)

Summary Table

Stage	Shape
Raw waveform	`(B, T_samples)`
Log-Mel features	`(B, T, 80)`
After subsampling	`(B, T/4, 256)`
After each Conformer block	`(B, T/4, 256)`
CTC output	`(B, T/4, vocab_size)`
Decoder output	`(B, L_text, vocab_size)`

The key insight is that the sequence length shrinks early (at the subsampler) and then stays constant all the way through the Conformer stack — this is what makes the self-attention computationally feasible. The model dimension d_model is similarly fixed throughout, acting as a consistent “information highway” between modules.

I Tested AI Detectors on My Daily Feeds — Here’s What I Found

2025-11-09T00:00:00+00:00

Introduction

With AI-generated content becoming increasingly common these days (even this blog post has been processed by AI), I’ve started wondering how much of what I read every day is actually produced by AI — and whether I still have the ability or tools to tell AI-generated content apart from human-written ones.

To explore this, I ran a small experiment. I chose three mainstream content platforms: X, LinkedIn, and Medium. Over the course of five days, I randomly selected three posts I encountered on each platform each day, then rated how likely I thought each piece was AI-generated.

In addition, I used five popular AI-text detectors — GPTZero, Originality.AI, Quillbot, GPT2 Output Detector, and GLTR — to evaluate the same content. While I didn’t have the ground truth (since I didn’t ask the authors whether they used AI), it was still fascinating to see how much of my daily input might be influenced by AI. It also made me wonder whether different platforms have different proportions of AI-generated content — and if so, whether we should start seeking out “high human-to-AI ratio” platforms when authenticity really matters.

Before diving into the results, I’ll first give a brief overview of how the public AI-text detectors that I use (GPT2 Output Detector and GLTR) actually work and how I gave my score.

GPT2 Output Detector

The GPT-2 Output Detector is a Roberta-based model designed for sentence classification. It is trained to distinguish between real human-generated text and text generated by GPT-2. The model outputs a probability indicating the likelihood that a given input sequence was produced by a human rather than by the GPT-2 language model.

For further details, you can refer to the following resources: OpenAI’s GPT-2 1.5B release, the research paper, and the GPT-2 Output Dataset repository.

GLTR

GLTR (Giant Language Model Test Room) leverages language models, like GPT-2, to detect AI-generated text. It is based on the idea that language generation models tend to select the most statistically likely words, while human-written text often exhibits more variation. If a paragraph contains words that are consistently high-probability choices for a language model, it is more likely to be AI-generated. GLTR visualizes the probability of each word being part of a GPT-2 output, helping to distinguish between human-written and AI-generated text.

For more information on GLTR, visit GLTR.io or check out the research paper.

In the following analysis, I used the ratio of “green” words to total words in GLTR as its estimate of AI-generation probability. For the other AI detectors, I simply used the scores or probabilities they provided to represent how likely the text was generated by AI.

Human Evaluation

All the human scores in this analysis were rated by me. For scoring, I adopted the Mean Opinion Score (MOS) approach: if I believed a piece was definitely written by AI, I gave it a 5; if I thought it was definitely human-written, I gave it a 1. These scores were later normalized to a 0–1 range for analysis.

(One thing worth noting is that I’m not a native English speaker, which might slightly influence my judgment when evaluating whether something “feels” AI-generated.)

Now, let’s move on to the results. First, we’ll look at the percentage of AI-generated content on each platform — as evaluated by each detector and by myself.

Experimental Results

We can see that Quillbot and the GPT-2 Output Detector (especially the latter) show almost no ability to distinguish between AI-generated and human-written content. Their outputs are highly concentrated, consistently producing AI probabilities close to zero. This isn’t too surprising for the GPT-2 Output Detector — after all, as of November 2025, very few AI-generated texts today are actually produced by GPT-2. However, it’s somewhat surprising that Quillbot, a commercial product, performs in a similar way.

On the other hand, GPTZero displays a noticeably wider score distribution. While that doesn’t necessarily mean it’s more accurate (since I don’t have the ground truth), it does suggest that GPTZero attempts a more fine-grained evaluation of how “AI-like” a text might be compared to the other detectors.

Next, let’s look at some statistics and put the results of all three platforms together.

Detector	Twitter	LinkedIn	Medium
GPTZero	0.218667 / 0.403111 / 0.380635	0.3 / 0.425357 / 0.511976	0.533333 / 0.41579 / 0.904627
Originality.ai	0.066667 / 0.255445 / 0.149693	0.192 / 0.391174 / -0.461454	0.294667 / 0.432119 / 0.697721
Quillbot	0.0 / 0.0 / 0.0	0.088 / 0.256549 / 0.299827	0.064 / 0.186616 / 0.206284
GPT2Output	0.000233 / 6.2e-05 / 0.968864	0.03834 / 0.147356 / 0.250164	0.00318 / 0.011459 / 0.203963
GLTR	0.674333 / 0.066996 / -0.229616	0.641067 / 0.066023 / -0.437913	0.6584 / 0.033645 / 0.341786
Human	0.48 / 0.248424 / 1.0	0.52 / 0.211119 / 1.0	0.413333 / 0.24456 / 1.0

Each cell in the table below shows the mean, standard deviation, and correlation with human scores (my ratings) for each detector.

From these results, we can see that GPTZero shows a noticeably strong correlation with my human ratings. Interestingly, even though the GPT-2 Output Detector almost always reports very low AI probabilities, its results still correlate fairly well with my scores.

When comparing across platforms, the overall proportion of AI-generated content appears quite similar among the three. This suggests that AI-assisted writing has become pervasive across mainstream platforms — and perhaps serves as a reminder to question whether what we’re reading truly reflects someone’s original thought, even when it’s posted under their name.

Conclusion

This post is just a small experiment exploring how much of the content I encounter daily might be AI-generated — and how different AI detectors perform on those samples. Personally, I think GPTZero outperforms the others, even though its underlying principles aren’t publicly known.

That said, as generative AI continues to evolve, it will inevitably become harder for humans to distinguish AI-generated content from human-written text. And as AI detectors improve, AI models can, in turn, learn from them and evolve even further.

But is it necessarily a bad thing if one day most of what we read is created by AI — and we can no longer tell the difference? Maybe not. If AI eventually behaves like it reaches or surpasses human-level intelligence, its outputs might actually become more insightful and valuable than many human-written pieces. At that point, AI-generated content could even become something we learn from, rather than something we try to detect and then ignore.

I find that idea a little scary — but also incredibly exciting.

Paper Reading: Progressive Neural Networks

2025-07-07T00:00:00+00:00

Progressive Neural Networks

Citation Count: 3443 Organization: DeepMind Year: 2016

Background

The paper Progressive Neural Networks is one of the earliest and most influential works in the field of continual learning. Although its core idea is simpler compared to other foundational papers—such as Overcoming Catastrophic Forgetting in Neural Networks (EWC)—it represents a different category of approaches. Instead of using regularization to prevent forgetting, it tackles the problem by expanding the network and adding new parameters when learning new tasks. Let’s take a closer look at how it works!

Method

If you’re familiar with regularization-based methods, you’ll recall that their key idea is to constrain the model—typically through some form of penalty—so it doesn’t forget previously learned tasks when adapting to new ones. A hidden assumption in that approach is the desire to keep model complexity fixed.

But what if model size isn’t a hard constraint? If we allow the network to grow, we can simply add new capacity for new tasks and preserve the existing knowledge as-is. This is the central idea behind Progressive Neural Networks.

Each column in the below architecture represents a sub-network dedicated to a specific task. Initially, only the first column exists to handle the first task. Whenever a new task exists, a new column is added to learn that task.

Here are a few key points worth noting:

Knowledge reuse: Unlike training a completely separate model for each new task, Progressive Neural Networks allow for knowledge transfer via lateral connections. These connections link each layer in the new column to the previous layers in trained columns, enabling the model to reuse useful representations.
No forgetting: When learning new tasks, the parameters of previously trained columns are frozen. This is the fundamental reason why catastrophic forgetting does not occur.

Conclusion

Progressive Neural Networks is a relatively easy-to-understand paper with clear intuition and straightforward implementation. Still, it’s important to appreciate why it belongs in the continual learning family and how its design philosophy differs from regularization-based approaches. While it trades off parameter efficiency for simplicity and robustness to forgetting, it remains a foundational method in the field—and an important one to understand.

Paper Reading: Overcoming Catastrophic Forgetting in Neural Network

2025-06-29T00:00:00+00:00

Background

One way to address catastrophic forgetting is to introduce regularization when learning new tasks. The idea is straightforward: continue learning the new task while applying regularization to prevent the model from forgetting what it learned from previous tasks. The tricky part is figuring out how to apply this regularization effectively. The paper “Overcoming Catastrophic Forgetting in Neural Networks” proposes a method called Elastic Weight Consolidation (EWC), which offers one such solution.

Fisher Information

From my point of view, the core idea of Elastic Weight Consolidation (EWC) is grounded in Fisher Information. If you develop a solid understanding of Fisher Information, you’ve essentially grasped 90% of the paper. However, the concept can be quite abstract, especially for those without a strong background in statistics. Personally, despite having a basic understanding of probability, it took me over 10 hours to truly grasp the foundational idea of Fisher Information.

In the following, I’ll walk you through Fisher Information from a beginner’s point of view. Some terminology may be simplified and not strictly rigorous, but the explanation should be sufficient to help you understand the core intuition behind the paper.

Problem to Solve

Let’s begin with the problem context. Imagine we have a set of data points that are assumed to come from a known family of probability distributions. Our goal is to estimate some unknown parameters of this distribution using the observed data.

For example, suppose the data points are drawn from a Gaussian distribution, but we don’t know the mean or variance. How precise can our estimation of the parameters can be, given the data? Fisher Information provides a way to quantify that.

But before we define Fisher Information, let’s revisit a fundamental concept: likelihood.

Likelihood

Most of us are familiar with probability. Given a distribution parameterized by some variable $\theta$, the probability tells us how likely a specific data point $x$ is under that distribution. If the data point is continuous, we can represent its probability using a probability density function, denoted as $p_{\theta}(x)$. In this case, $x$ is the variable, and $\theta$ is fixed.

In contrast, likelihood flips the perspective. Given an observed data point $x$, likelihood asks: How does the probability of observing this data change as the parameter $\theta$ varies? There are two things noteworthy here:

Likelihood is a function of the parameter $\theta$, with the data $x$ fixed.
It is usually proportional to the probability density evaluated at $x$ for a given $\theta$.

Mathematically, we write the likelihood as: $L(\theta\vert x) = p_{\theta}(x)$. Even though it looks the same as the PDF, the interpretation is different: here, we treat $x$ as fixed and view $L$ as a function of $\theta$. With a basic understanding of likelihood in place, let’s now return to the concept of Fisher Information.

Now to Fisher Information

Mathematically, Fisher Information is denoted as:

\[I_{X}(\theta) = \int_x{p(x|\theta) \ (\frac{d}{d\theta}log(p(x|\theta))^2\ dx}\]

This formula can be broken down into two main parts:

The integrand: $(\frac{d}{d\theta}log(p(x \vert \theta))^2$, which is the square of the derivative of the log-likelihood with respect to the parameter $\theta$.
The integration: Taking the expectation over all possible values of $x$, weighted by their likelihood under the current parameter $\theta$.

The integration is straight forward while the integrand is complicate. Let’s unpack the intuition behind the integrand step-by-step.

First, note that the log is used for mathematical convenience—it turns products into sums and simplifies derivatives. Since it’s a monotonic function, it doesn’t fundamentally change where the likelihood peaks or how sensitive it is to $\theta$. So for intuition, we can temporarily set aside the log and focus on the likelihood itself.
Next, the derivative of the likelihood (or log-likelihood) with respect to $\theta$ tells us how sensitive the probability of observing data point $x$ is to changes in the parameter. In other words, it measures how much the likelihood shifts when $\theta$ is nudged.
Then, the square of the derivative simply removes the sign and emphasizes the magnitude—ensuring that both increases and decreases in likelihood contribute positively to the overall measure.

Putting this together:

If the derivative is close to zero, it means that small changes in $\theta$ don’t affect the likelihood much. This suggests that $x$ doesn’t contain much information about $\theta$ at that value because nearby values are almost the same likely to generate the same $x$, and thus, we can’t estimate $\theta$ very precisely based on $x$—the Fisher Information is low. Conversely, if the derivative is large, the likelihood is very sensitive to $\theta$, meaning $x$ can locate $\theta$ well—the Fisher Information is high.

Now, I think you’ve got a basic understanding of why Fisher Information can manifest the sensitivity of data $x$ to parameter $\theta$.

Back to EWC

Let’s return to EWC. At its core, Elastic Weight Consolidation (EWC) is based on the above loss function, which consists of two components. The first is the standard loss for task B (the new task), and the second is a regularization term. In this term, $F$ denotes the Fisher Information Matrix, $\theta$ represents the current weights being optimized, and $\theta_A^*$ refers to the weights previously learned for task A (the old task). $\lambda$ is a hyperparameter that controls the strength of the regularization, and $i$ indexes the model parameters.

In other words, weights associated with higher Fisher Information—meaning small changes in them would significantly affect the likelihood—are considered more critical for preserving performance on previous tasks. EWC penalizes changes to these important weights more heavily, thereby preventing the model from “forgetting” them when learning new tasks. This is how Fisher Information is used to guide the regularization in EWC and mitigate catastrophic forgetting.

The paper also provides a theoretical justification for the form of the loss function, even though the intuition is quite clear. For a more formal understanding, you can refer to Equations (1) and (2) in the paper, which are relatively straightforward to derive.

In the experimental section, the authors evaluate EWC on both a classic benchmark—Permuted MNIST—and reinforcement learning tasks. The results show that EWC outperforms vanilla SGD and L2 regularization in terms of maintaining performance across tasks. We won’t go into the experimental details here.

Conclusion

“Overcoming Catastrophic Forgetting in Neural Networks” is a classic paper in the field of continual learning. It presents a clear narrative, combining intuitive ideas with solid mathematical foundations, and supports them with well-designed experiments that demonstrate the method’s effectiveness. While fully grasping every detail may require a strong background in statistics or some additional reading, the paper is absolutely worth revisiting multiple times to understand the reasoning behind its approach.

Paper Reading: Learning without Forgetting

2025-06-01T00:00:00+00:00

Learning without Forgetting

Citation Count: 5576
Main Idea: Use old model’s prediction on new data to regularize the network from forgetting old tasks when learning new tasks (under the limitation that has no access to previous data)
Organization: UIUC
Year: 2016

Paper Link

https://arxiv.org/pdf/1606.09282

Background

There are three common approaches to learning new classification tasks while maintaining performance on previously learned ones, each with distinct advantages and disadvantages:

Feature extraction: The original network’s feature extraction and classifier components remain fixed, while a new classifier is added to the feature extractor’s output to learn new tasks.
- Best performance on original tasks
- Suboptimal performance on new tasks since features are optimized for old tasks
Fine Tuning: Both the original network’s feature extractor and the new classifier are tuned for the new task.
- Good performance on new tasks
- Degraded performance on old tasks as features become optimized for new tasks
Joint Training: Data from both old and new tasks are combined for training, with the feature extractor and all classifiers being tuned simultaneously.
- Good performance on both old and new tasks
- Requires access to old task data
- Longer training time due to increased data volume and full network retraining

To combine the advantages and eliminate the disadvantages of these methods, the authors proposed Learning without Forgetting (LWF), which offers:

Good performance on both new and old tasks
No need for old task data
Training time comparable to fine-tuning when learning new tasks

Main Idea

While the original paper contains many details, the core concept is straightforward:

Joint training achieves good performance on both old and new tasks but requires old data. To overcome this limitation, we can use an alternative approach: have the original network classify new data for old tasks, recording these outputs as a representation of the original network. Then, while learning new tasks, train the network to maintain these recorded outputs (feature extractor + old classifier) while simultaneously learning the new tasks (feature extractor + new classifier).

The core idea can be expressed mathematically as follows:

where $s$ represents the feature extractor, $o$ relates to the old classifier, $s$ relates to the new classifier, and $R$ represents regularization terms.

Experimental Results

The experimental results demonstrate that LWF achieves a good balance between old and new tasks, as illustrated below:

While most ablation studies yield minor insights, one notable finding is that task dissimilarity affects performance: when new tasks differ significantly from old ones, performance on old tasks deteriorates across almost all methods. This phenomenon is also mentioned by other papers.

Conclusion

Learning without Forgetting stands as a seminal paper in continual learning. Though its method and performance may seem modest by today’s standards, it remains widely cited in subsequent research. Understanding this work provides crucial insights into basic approaches for preventing catastrophic forgetting. When studied alongside other key papers in continual learning, it helps illuminate the field’s evolution and underlying principles.