RNA-seq Workflow: From FASTQ to Differential Expression

What is RNA-seq?

RNA sequencing (RNA-seq) is a high-throughput technique for measuring gene expression across an entire transcriptome. By converting RNA into complementary DNA (cDNA) and sequencing millions of short fragments, we can determine which genes are active — and by how much — in a given biological sample under specific conditions.

The central question RNA-seq answers: "Which genes are expressed differently between my two conditions?" That might be treated vs. untreated cells, diseased vs. healthy tissue, or any other comparison.

ℹ️

This guide uses a Linux command-line environment. Most steps assume paired-end Illumina reads and a reference genome. Single-end workflows are very similar — just drop the second file argument.

Tools you'll need

FastQC

Per-read quality metrics and adapter content checks

Trimmomatic

Adapter and low-quality base trimming

HISAT2 / STAR

Splice-aware alignment to a reference genome

SAMtools

Sort, index, and manipulate BAM files

featureCounts

Count reads overlapping annotated features

DESeq2

R package for differential expression analysis

// step 01

Understanding FASTQ Format

Every sequencing run produces FASTQ files — a text-based format encoding both the nucleotide sequence and a per-base quality score. Each read spans exactly 4 lines:

fastq

# Line 1: sequence identifier (starts with @)
@SRR12345678.1 1 length=150
# Line 2: raw nucleotide sequence
ACGTACGTACGTACGTACGT...
# Line 3: separator (always a +)
+
# Line 4: quality scores (ASCII encoded, same length as line 2)
IIIIIIIIHHHHGGGFFEEE...

Quality scores are Phred-encoded: the character I means Q40 (0.01% error rate), while ! means Q0 (100% error rate). The minimum acceptable average quality for most analyses is Q20–Q30.

💡

Paired-end sequencing produces two files per sample: sample_R1.fastq.gz (forward) and sample_R2.fastq.gz (reverse). The read order must match between files!

// step 02

Quality Control with FastQC

Before doing anything else, always run FastQC. It produces an HTML report with multiple quality modules so you can spot problems early — adapter contamination, low-quality tails, GC content anomalies, or PCR duplicates.

bash

# Run FastQC on both paired-end files
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz -o ./fastqc_out/

# If you have many samples, use MultiQC to aggregate all reports
multiqc ./fastqc_out/ -o ./multiqc_out/

What to look for

The most important FastQC modules for RNA-seq are:

Module	What it tells you	Typical outcome
Per base seq quality	Quality score by position	Should be ≥Q28 across the read; drop at 3′ end is normal
Per sequence quality	Distribution of mean qualities	Spike near Q30–Q38 = healthy library
Adapter content	Illumina adapter contamination	Should be near 0%; if high → trim
Sequence duplication	PCR or optical duplicates	RNA-seq often has high duplication — usually fine unless extreme
GC content	Genome-level GC distribution	Should match your organism; bimodal = contamination

⚠️

Don't panic at red flags in FastQC — RNA-seq routinely fails "sequence duplication" and "per-sequence GC content" modules due to the nature of transcriptome sequencing. Context matters more than the color-coded pass/fail.

// step 03

Adapter Trimming with Trimmomatic

Adapter sequences must be removed before alignment. They arise when the DNA fragment is shorter than the read length, causing the sequencer to read into the adapter ligated to the 3′ end. Trim Galore (a wrapper around Cutadapt) and Trimmomatic are the most popular options.

bash — Trimmomatic PE

trimmomatic PE -threads 8 -phred33 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1_paired.fastq.gz   sample_R1_unpaired.fastq.gz \
  sample_R2_paired.fastq.gz   sample_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 \
  TRAILING:3 \
  SLIDINGWINDOW:4:15 \
  MINLEN:36

Key parameters explained: ILLUMINACLIP removes adapters with up to 2 mismatches; SLIDINGWINDOW:4:15 cuts when the average quality in a 4-base window drops below Q15; MINLEN:36 discards reads shorter than 36 bp after trimming.

💡

After trimming, re-run FastQC on the trimmed files. You should see adapter content drop to near zero and per-base quality improve at the 3′ end.

// step 04

Alignment: HISAT2 vs STAR

RNA-seq reads span exon–intron boundaries, so we need a splice-aware aligner. The two workhorses are HISAT2 (fast, lower memory) and STAR (highly accurate, higher memory). Both require a pre-built genome index.

Building a HISAT2 index

bash

# Download genome and annotation (example: GRCh38)
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

# Build HISAT2 index (takes ~30 min, ~8 GB RAM)
hisat2-build -p 8 \
  Homo_sapiens.GRCh38.dna.toplevel.fa \
  ./genome_index/GRCh38

Running HISAT2 alignment

bash

hisat2 -p 8 \
  -x ./genome_index/GRCh38 \
  -1 sample_R1_paired.fastq.gz \
  -2 sample_R2_paired.fastq.gz \
  --dta \
  -S sample.sam \
  2> sample_hisat2.log

# Convert SAM → BAM, sort, and index
samtools view -@ 8 -bS sample.sam | \
  samtools sort -@ 8 -o sample_sorted.bam
samtools index sample_sorted.bam
rm sample.sam  # save disk space

The --dta flag ("downstream transcriptome assembly") produces output optimized for quantification tools like StringTie or featureCounts. Check your alignment log — a healthy RNA-seq experiment maps ≥ 70–85% of reads to the genome.

Aligner	Speed	Memory	Best for
HISAT2	Very fast	~4–8 GB	Standard alignment, low-memory servers
STAR	Fast	30+ GB	High accuracy, novel splice junctions, large-scale studies

// step 05

Quantification with featureCounts

Alignment gives us a BAM file. Now we need to count how many reads overlap each gene using a GTF/GFF annotation file. featureCounts (from the Subread package) is the most widely used tool for this.

bash

featureCounts \
  -T 8 \
  -p                 # paired-end
  -s 2               # stranded: 0=unstranded, 1=stranded, 2=reverse
  -a Homo_sapiens.GRCh38.109.gtf \
  -o counts_matrix.txt \
  sample1_sorted.bam \
  sample2_sorted.bam \
  sample3_sorted.bam

The output is a tab-delimited count matrix where rows = genes and columns = samples. This matrix is the direct input to DESeq2 or edgeR.

⚠️

The strandedness parameter (-s) must match your library preparation protocol. Getting it wrong silently discards 50% of your reads. If unsure, use RSeQC infer_experiment.py to detect it automatically.

Alternative: pseudo-alignment with Salmon

For very fast quantification without a genome alignment step, Salmon or Kallisto align directly to a transcriptome reference and estimate transcript-level counts in minutes. They're excellent for quick analyses but skip the BAM file entirely.

// step 06

Differential Expression with DESeq2

DESeq2 uses a negative binomial model to account for the count-based, overdispersed nature of RNA-seq data. It handles normalization (size factors), dispersion estimation, and statistical testing all in one coherent framework.

library(DESeq2)

# 1. Load the count matrix (genes x samples)
counts <- read.table("counts_matrix.txt", header=TRUE, row.names=1, skip=1)[,6:ncol(counts)]

# 2. Create a sample metadata table
coldata <- data.frame(
  row.names = colnames(counts),
  condition = c("control", "control", "treated", "treated")
)

# 3. Build DESeqDataSet object
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ condition
)

# 4. Pre-filter: remove genes with < 10 total counts
dds <- dds[rowSums(counts(dds)) >= 10, ]

# 5. Run DESeq2 — normalization + testing in one call
dds <- DESeq(dds)

# 6. Extract results: treated vs control
res <- results(dds,
  contrast   = c("condition", "treated", "control"),
  alpha      = 0.05,    # FDR threshold
  lfcThreshold = 0
)

# 7. Shrink log2FoldChanges for visualization (recommended)
res_shrunk <- lfcShrink(dds, coef="condition_treated_vs_control", type="apeglm")

# 8. Save significant results
res_sig <- subset(res_shrunk, padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(as.data.frame(res_sig), "DE_genes_significant.csv")

💡

Always use adjusted p-values (padj), not raw p-values. DESeq2 applies the Benjamini-Hochberg correction by default. A padj < 0.05 means the expected false discovery rate among your significant hits is ≤ 5%.

// step 07

Interpreting Your Results

Raw DE gene lists are only the beginning. Standard visualizations help you quality-check your data and communicate findings clearly.

Volcano Plot

A volcano plot shows effect size (log2FC) on the x-axis and statistical significance (−log10 padj) on the y-axis. Genes in the upper corners are both significant and large in effect — your primary candidates.

// example volcano plot (schematic)

R — volcano plot

library(ggplot2)
library(dplyr)

df <- as.data.frame(res_shrunk) %>%
  mutate(
    sig   = padj < 0.05 & abs(log2FoldChange) > 1,
    color = case_when(
      sig & log2FoldChange > 1  ~ "Up",
      sig & log2FoldChange < -1 ~ "Down",
      TRUE                       ~ "NS"
    )
  )

ggplot(df, aes(x=log2FoldChange, y=-log10(padj), color=color)) +
  geom_point(alpha=0.5, size=1.5) +
  scale_color_manual(values=c(Up="#39d353", Down="#f78166", NS="#3d444d")) +
  geom_vline(xintercept=c(-1,1), linetype="dashed", alpha=0.5) +
  geom_hline(yintercept=-log10(0.05), linetype="dashed", alpha=0.5) +
  theme_minimal() +
  labs(title="Volcano Plot: Treated vs Control")

Heatmap of top DE genes

R — heatmap

library(pheatmap)

# Variance-stabilizing transform for visualization
vst_data <- vst(dds, blind=FALSE)

# Top 50 DE genes by padj
top_genes <- rownames(head(res_sig[order(res_sig$padj), ], 50))

pheatmap(
  assay(vst_data)[top_genes, ],
  scale = "row",
  annotation_col = coldata,
  show_rownames = TRUE,
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  color = colorRampPalette(c("#58a6ff", "white", "#f78166"))(100)
)

Downstream: Gene Ontology and Pathway Enrichment

A list of DE genes becomes meaningful through enrichment analysis. Tools like clusterProfiler, g:Profiler, or GSEA tell you which biological processes, pathways, or gene sets are over-represented in your results compared to the background.

ℹ️

Quick interpretation rules of thumb: A log2FC of 1 = 2× fold change. A log2FC of −2 = 4× down-regulation. Aim for |log2FC| > 1 and padj < 0.05 as a starting cutoff, but adjust based on your biological context.

// fin

Summary & Next Steps

You now have a complete RNA-seq pipeline from raw FASTQ files to a list of differentially expressed genes. The key stages are: QC → trimming → alignment → quantification → DE testing. Each step has sensible defaults, but understanding why each parameter exists will help you troubleshoot and adapt the workflow to your data.

From here, consider exploring: transcript-level analysis with tximeta/DEXSeq for splicing; single-cell RNA-seq with Seurat or Scanpy for cell-type resolution; or long-read RNA-seq with PacBio/Nanopore for isoform characterization.

🧬

The best sanity check: do your top DE genes make biological sense? If your experiment is a heat-shock treatment and you see HSP70/HSP90 family genes at the top of your up-regulated list — you're on the right track.