What is RNA-seq?
RNA sequencing (RNA-seq) is a high-throughput technique for measuring gene expression across an entire transcriptome. By converting RNA into complementary DNA (cDNA) and sequencing millions of short fragments, we can determine which genes are active — and by how much — in a given biological sample under specific conditions.
The central question RNA-seq answers: "Which genes are expressed differently between my two conditions?" That might be treated vs. untreated cells, diseased vs. healthy tissue, or any other comparison.
This guide uses a Linux command-line environment. Most steps assume paired-end Illumina reads and a reference genome. Single-end workflows are very similar — just drop the second file argument.
Tools you'll need
Understanding FASTQ Format
Every sequencing run produces FASTQ files — a text-based format encoding both the nucleotide sequence and a per-base quality score. Each read spans exactly 4 lines:
# Line 1: sequence identifier (starts with @)
@SRR12345678.1 1 length=150
# Line 2: raw nucleotide sequence
ACGTACGTACGTACGTACGT...
# Line 3: separator (always a +)
+
# Line 4: quality scores (ASCII encoded, same length as line 2)
IIIIIIIIHHHHGGGFFEEE...
Quality scores are Phred-encoded: the character I means Q40 (0.01% error rate), while ! means Q0 (100% error rate). The minimum acceptable average quality for most analyses is Q20–Q30.
Paired-end sequencing produces two files per sample: sample_R1.fastq.gz (forward) and sample_R2.fastq.gz (reverse). The read order must match between files!
Quality Control with FastQC
Before doing anything else, always run FastQC. It produces an HTML report with multiple quality modules so you can spot problems early — adapter contamination, low-quality tails, GC content anomalies, or PCR duplicates.
# Run FastQC on both paired-end files
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz -o ./fastqc_out/
# If you have many samples, use MultiQC to aggregate all reports
multiqc ./fastqc_out/ -o ./multiqc_out/
What to look for
The most important FastQC modules for RNA-seq are:
| Module | What it tells you | Typical outcome |
|---|---|---|
| Per base seq quality | Quality score by position | Should be ≥Q28 across the read; drop at 3′ end is normal |
| Per sequence quality | Distribution of mean qualities | Spike near Q30–Q38 = healthy library |
| Adapter content | Illumina adapter contamination | Should be near 0%; if high → trim |
| Sequence duplication | PCR or optical duplicates | RNA-seq often has high duplication — usually fine unless extreme |
| GC content | Genome-level GC distribution | Should match your organism; bimodal = contamination |
Don't panic at red flags in FastQC — RNA-seq routinely fails "sequence duplication" and "per-sequence GC content" modules due to the nature of transcriptome sequencing. Context matters more than the color-coded pass/fail.
Adapter Trimming with Trimmomatic
Adapter sequences must be removed before alignment. They arise when the DNA fragment is shorter than the read length, causing the sequencer to read into the adapter ligated to the 3′ end. Trim Galore (a wrapper around Cutadapt) and Trimmomatic are the most popular options.
trimmomatic PE -threads 8 -phred33 \
sample_R1.fastq.gz sample_R2.fastq.gz \
sample_R1_paired.fastq.gz sample_R1_unpaired.fastq.gz \
sample_R2_paired.fastq.gz sample_R2_unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36
Key parameters explained: ILLUMINACLIP removes adapters with up to 2 mismatches; SLIDINGWINDOW:4:15 cuts when the average quality in a 4-base window drops below Q15; MINLEN:36 discards reads shorter than 36 bp after trimming.
After trimming, re-run FastQC on the trimmed files. You should see adapter content drop to near zero and per-base quality improve at the 3′ end.
Alignment: HISAT2 vs STAR
RNA-seq reads span exon–intron boundaries, so we need a splice-aware aligner. The two workhorses are HISAT2 (fast, lower memory) and STAR (highly accurate, higher memory). Both require a pre-built genome index.
Building a HISAT2 index
# Download genome and annotation (example: GRCh38)
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
# Build HISAT2 index (takes ~30 min, ~8 GB RAM)
hisat2-build -p 8 \
Homo_sapiens.GRCh38.dna.toplevel.fa \
./genome_index/GRCh38
Running HISAT2 alignment
hisat2 -p 8 \
-x ./genome_index/GRCh38 \
-1 sample_R1_paired.fastq.gz \
-2 sample_R2_paired.fastq.gz \
--dta \
-S sample.sam \
2> sample_hisat2.log
# Convert SAM → BAM, sort, and index
samtools view -@ 8 -bS sample.sam | \
samtools sort -@ 8 -o sample_sorted.bam
samtools index sample_sorted.bam
rm sample.sam # save disk space
The --dta flag ("downstream transcriptome assembly") produces output optimized for quantification tools like StringTie or featureCounts. Check your alignment log — a healthy RNA-seq experiment maps ≥ 70–85% of reads to the genome.
| Aligner | Speed | Memory | Best for |
|---|---|---|---|
| HISAT2 | Very fast | ~4–8 GB | Standard alignment, low-memory servers |
| STAR | Fast | 30+ GB | High accuracy, novel splice junctions, large-scale studies |
Quantification with featureCounts
Alignment gives us a BAM file. Now we need to count how many reads overlap each gene using a GTF/GFF annotation file. featureCounts (from the Subread package) is the most widely used tool for this.
featureCounts \
-T 8 \
-p # paired-end
-s 2 # stranded: 0=unstranded, 1=stranded, 2=reverse
-a Homo_sapiens.GRCh38.109.gtf \
-o counts_matrix.txt \
sample1_sorted.bam \
sample2_sorted.bam \
sample3_sorted.bam
The output is a tab-delimited count matrix where rows = genes and columns = samples. This matrix is the direct input to DESeq2 or edgeR.
The strandedness parameter (-s) must match your library preparation protocol. Getting it wrong silently discards 50% of your reads. If unsure, use RSeQC infer_experiment.py to detect it automatically.
Alternative: pseudo-alignment with Salmon
For very fast quantification without a genome alignment step, Salmon or Kallisto align directly to a transcriptome reference and estimate transcript-level counts in minutes. They're excellent for quick analyses but skip the BAM file entirely.
Differential Expression with DESeq2
DESeq2 uses a negative binomial model to account for the count-based, overdispersed nature of RNA-seq data. It handles normalization (size factors), dispersion estimation, and statistical testing all in one coherent framework.
library(DESeq2)
# 1. Load the count matrix (genes x samples)
counts <- read.table("counts_matrix.txt", header=TRUE, row.names=1, skip=1)[,6:ncol(counts)]
# 2. Create a sample metadata table
coldata <- data.frame(
row.names = colnames(counts),
condition = c("control", "control", "treated", "treated")
)
# 3. Build DESeqDataSet object
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ condition
)
# 4. Pre-filter: remove genes with < 10 total counts
dds <- dds[rowSums(counts(dds)) >= 10, ]
# 5. Run DESeq2 — normalization + testing in one call
dds <- DESeq(dds)
# 6. Extract results: treated vs control
res <- results(dds,
contrast = c("condition", "treated", "control"),
alpha = 0.05, # FDR threshold
lfcThreshold = 0
)
# 7. Shrink log2FoldChanges for visualization (recommended)
res_shrunk <- lfcShrink(dds, coef="condition_treated_vs_control", type="apeglm")
# 8. Save significant results
res_sig <- subset(res_shrunk, padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(as.data.frame(res_sig), "DE_genes_significant.csv")
Always use adjusted p-values (padj), not raw p-values. DESeq2 applies the Benjamini-Hochberg correction by default. A padj < 0.05 means the expected false discovery rate among your significant hits is ≤ 5%.
Interpreting Your Results
Raw DE gene lists are only the beginning. Standard visualizations help you quality-check your data and communicate findings clearly.
Volcano Plot
A volcano plot shows effect size (log2FC) on the x-axis and statistical significance (−log10 padj) on the y-axis. Genes in the upper corners are both significant and large in effect — your primary candidates.
library(ggplot2)
library(dplyr)
df <- as.data.frame(res_shrunk) %>%
mutate(
sig = padj < 0.05 & abs(log2FoldChange) > 1,
color = case_when(
sig & log2FoldChange > 1 ~ "Up",
sig & log2FoldChange < -1 ~ "Down",
TRUE ~ "NS"
)
)
ggplot(df, aes(x=log2FoldChange, y=-log10(padj), color=color)) +
geom_point(alpha=0.5, size=1.5) +
scale_color_manual(values=c(Up="#39d353", Down="#f78166", NS="#3d444d")) +
geom_vline(xintercept=c(-1,1), linetype="dashed", alpha=0.5) +
geom_hline(yintercept=-log10(0.05), linetype="dashed", alpha=0.5) +
theme_minimal() +
labs(title="Volcano Plot: Treated vs Control")
Heatmap of top DE genes
library(pheatmap)
# Variance-stabilizing transform for visualization
vst_data <- vst(dds, blind=FALSE)
# Top 50 DE genes by padj
top_genes <- rownames(head(res_sig[order(res_sig$padj), ], 50))
pheatmap(
assay(vst_data)[top_genes, ],
scale = "row",
annotation_col = coldata,
show_rownames = TRUE,
cluster_rows = TRUE,
cluster_cols = TRUE,
color = colorRampPalette(c("#58a6ff", "white", "#f78166"))(100)
)
Downstream: Gene Ontology and Pathway Enrichment
A list of DE genes becomes meaningful through enrichment analysis. Tools like clusterProfiler, g:Profiler, or GSEA tell you which biological processes, pathways, or gene sets are over-represented in your results compared to the background.
Quick interpretation rules of thumb: A log2FC of 1 = 2× fold change. A log2FC of −2 = 4× down-regulation. Aim for |log2FC| > 1 and padj < 0.05 as a starting cutoff, but adjust based on your biological context.
Summary & Next Steps
You now have a complete RNA-seq pipeline from raw FASTQ files to a list of differentially expressed genes. The key stages are: QC → trimming → alignment → quantification → DE testing. Each step has sensible defaults, but understanding why each parameter exists will help you troubleshoot and adapt the workflow to your data.
From here, consider exploring: transcript-level analysis with tximeta/DEXSeq for splicing; single-cell RNA-seq with Seurat or Scanpy for cell-type resolution; or long-read RNA-seq with PacBio/Nanopore for isoform characterization.
The best sanity check: do your top DE genes make biological sense? If your experiment is a heat-shock treatment and you see HSP70/HSP90 family genes at the top of your up-regulated list — you're on the right track.