Enes K. Ergin

It's Alive!

2026-06-02T00:00:00+00:00

💡</span>

TLDR:</strong> This site is now built with Zola</a>, a static site generator that turns markdown into HTML in under 100ms. Custom shortcodes for callouts, sidenotes, code blocks, figures, and collapsible details. Old posts migrated from an obsolete platform with retroactive commentary added through sidenotes. Dark mode included. No JavaScript framework required.</p> </div> </div>

The branch that would not die</h2>
For months, this site existed as a branch on my machine. I would open it on weekends. Tweak the CSS. Write a shortcode. Close it again. The branch grew opinions about font weights.</p>
Today I merged it to main.</p>
The old site was on a platform I do not want to name because it is not their fault I outgrew it ◆</span> That part bothered me more than the performance. Writing a post should mean opening a file, writing, and running a command. Not fighting a WYSIWYG editor that keeps rearranging your paragraph breaks. </span> . It worked. It was also slow, locked into a database, and required logging into a browser to publish a 200-word note.</p>
I needed something I could edit in vim and deploy with a single command. Zola does both. Markdown files in, static HTML out. No database. No login screen. No build step that outlasts my attention span.</p>

Bash</div>
./build.sh # pre-process tags, generate star colors, build site git push # GitHub Pages picks it up automatically</code></pre> CS1: The entire deployment pipeline.</div> </div> That is the whole thing. Three files to publish a post, one push to deploy it. I spent months building what amounts to a very opinionated markdown renderer with nice CSS.</p> Shortcodes that do one thing</h2> Zola calls them shortcodes. They are template fragments you invoke from markdown ◆</span> Block shortcodes use the bracket-percent syntax and wrap content. Inline shortcodes use the double-brace syntax. Both are Zola-specific. Other static site generators handle this differently. </span> . A file in templates/shortcodes/</code>, a parameter or two, and suddenly you have Notion-style formatting in a static site.</p> I built callouts, sidenotes, code blocks, figures, and collapsible details. Each one because I kept wanting that thing and it did not exist. The callout gives me colored boxes with emoji icons for tip, note, warning, and danger. The sidenote floats commentary into the right margin so it does not break the reading flow. The code block adds a language badge so you know what you are looking at. Small things. They add up.</p> The post kind system came from the same place. I wanted the blog listing to tell you what you were clicking into before you clicked. Technical deep dives get a table of contents and reading time. Short observations get neither. Opinion pieces and paper notes get their own presentation. The badge in the header and on the listing page handles the rest.</p> Design that stays out of the way</h2> The design brief was short: make it look like I wrote it. Not like a template got filled in.</p> Muted backgrounds. Inter at weight 300 for body text. A teal accent that appears exactly where you need it and nowhere else. Monospace for code, metadata, and dates. Dark mode that is not an afterthought ◆</span> There is something absurd about spending this much time on CSS for a personal blog. And also something satisfying about getting the teal accent exactly right in both light and dark mode. </span> .</p> The site fits a lot of information into a small space without feeling crowded. That balance took the longest to get right.</p> The build produces 20 static pages and completes in 92ms. Total page weight, including fonts and CSS, is about the same as a single Medium article loaded with their JavaScript framework, analytics tracker, and cookie consent banner ◆</span> I checked. The difference is two orders of magnitude. This is not a flex. It is an observation about what we accepted as normal. </span> .</p> The old posts live here now</h2> I migrated most of the old content. Some posts did not make the cut. Unfinished drafts, notes that no longer applied, a few early pieces that did not fit what I want this site to be. The ones that stayed are organized by kind and retroactively tagged to cross-link with current projects.</p> The oldest surviving post is from 2018, a fuzzy C-means clustering write-up from my Master's ◆</span> I re-read it before migrating and added a note where my understanding of the math had changed. Young me was enthusiastic if not always precise. </span> . The content is untouched. The commentary is new.</p> Posts that made the journey</summary> Fuzzy C-Means clustering on the Iris dataset (2018)</li> Prosit spectral prediction (2019)</li> PRODA: proteoform detection (2020)</li> ProteoForge deep dive (2025)</li> Opinion pieces on Zig, vendor formats, search engines, storage (2025-2026)</li> z-fasta, mzarc, zigR, HarmonizePy, and all the Zig experiments (2026)</li> </ul> </div> </details> Each migrated post got the same treatment: convert the frontmatter, assign the right kind, add shortcodes where they improved readability. The ones that needed correction got retroactive sidenotes. The ones that still hold up got left alone.</p> What is next</h2> Several things I want to add before I call this finished. A comments section, if I can find one that does not require a JavaScript framework. Analytics that count visits without counting visitors. Cloudflare if I move the DNS there. A proper RSS template. Per-page meta tags for social previews. And the presentations and teaching pages need the same treatment the blog just got.</p> The blog itself will keep growing. The project page needs a compression benchmark comparison. I want to write up the SQuAPP workflow properly. There is a half-finished post about why I keep building things in Zig instead of using existing tools.</p> But mostly I want to use this thing. The site is built, the branch is merged, and the publishing pipeline is a single push away.</p> The rest is just writing.</p>

HarmonizePy: HarmonizR's Approach, in Python 2026-05-28T00:00:00+00:00 💡</span> HarmonizePy is public.</strong> Pure-Python batch correction for omics data, built around HarmonizR's core idea: handle structural missingness by grouping features by observed batch support, correcting compatible subsets, and reassembling without imputation. pip install harmonizepy</code>.</p> </div> </div> HarmonizR's idea is simple once you see it. Standard ComBat and limma need dense matrices. Real proteomics data has structured missingness (a peptide might be quantifiable in batches 1 and 2 but absent from batch 3 for reasons that have nothing to do with zero abundance). Most approaches impute first or drop features. Voß et al. ◆</span> Voß H et al. "HarmonizR enables data harmonization across independent proteomic datasets with appropriate handling of missing values." Nature Communications</em> 13:3523, 2022. </span> asked a different question: what if you let the missingness structure tell you which features should be corrected together?</p> Split the matrix by batch-presence patterns. Each feature joins only the batches that observed it. Correct those sub-matrices independently. Reassemble. No imputation.</p> </blockquote> That is the idea. Schlumbohm et al. ◆</span> Schlumbohm S, Neumann JE, Neumann P. "HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation." BMC Bioinformatics</em>, 2025. </span> later added blocking and singular-feature rescue. The core approach is the same and it is worth having in Python.</p> We used HarmonizR in the lab. It worked. The friction was not the algorithm. It was reaching across languages every time. Move data to R, correct, bring it back. Once is fine. When the workflow needs to be shared or automated, that extra hop becomes a reason to skip the correction.</p> HarmonizePy keeps the pipeline inside Python.</p> Python</div> from harmonizepy import harmonize result = harmonize("data.tsv", "batch.csv")</code></pre> CS1: The harmonize() entry point.</div> </div> That is the main thing. A single call. Same shape as the input. No imputation. No R installation.</p> The implementation was rebuilt from the method descriptions, not ported from the R source line by line ◆</span> Direct ports carry assumptions from the original language without forcing a real understanding of the algorithm. Rebuilding from the papers meant testing against expected outputs, not copying code. </span> . It uses NumPy for the core engines, has 628 tests validated against sva::ComBat</code>, limma::removeBatchEffect</code>, and HarmonizR v1.10.0, and is numerically concordant within documented tolerances.</p> The performance works out well for pure Python ◆</span> On a Ryzen 3950X, HarmonizePy runs 54x to 266x faster than single-core HarmonizR across the benchmarked datasets. The largest gaps are on non-parametric ComBat and SCP-scale data. Details in the Benchmarks wiki page. </span> . NumPy's BLAS-backed operations avoid R's per-iteration overhead, but that was never the goal. The goal was keeping the workflow in one language.</p> 📝</span> What this changes.</strong> A script that the lab used internally is now a public package with tests, docs, a CLI, and a stable API. The work was making it independent of the original author, of the R runtime, and of the private repository, so the next person who needs to correct batch effects in Python has something to start from.</p> </div> </div> Making it public does not mean it is finished. Diagnostics, better guidance on when correction is working, and more real-world validation are still useful. What it means is the implementation is open, inspectable, and usable without asking for access.</p> That was the handoff I wanted.</p> pxseek: From a Selenium Script to a PyPI Package 2026-05-26T00:00:00+00:00 💡</span> pxseek is on PyPI.</strong> CLI and Python library for querying ProteomeXchange metadata. Filter by species, repository, keywords, date range, instruments. Local caching, JSON and table output. pip install pxseek</code>.</p> </div> </div> The first version was not a package. It was a small parser from 2024, built to help hunt down pediatric cancer proteomics datasets on ProteomeXchange. I worked on it with a co-op student. The goal was practical: find datasets faster, collect enough metadata to decide what was worth looking at, and avoid doing the same manual search again and again.</p> It worked, but it relied on Selenium ◆</span> Browser automation is the kind of dependency you accept for a prototype and regret in production. Chrome versions change. Driver versions change. Pages change. A tool that helps you today can fail silently tomorrow and you will not know why until you debug the browser. </span> . I did not want the lab to depend on that long term.</p> Before I left my post in Lange Lab, I rewrote the core to use ProteomeCentral metadata directly. No browser. No driver. Just HTTP requests and structured data.</p> Bash</div> pxseek fetch -o px_datasets.tsv pxseek filter -i px_datasets.tsv -s "Homo sapiens" -k "cancer" -o shortlist.tsv pxseek lookup --input shortlist.tsv -o detailed.tsv</code></pre> CS1: Fetch, filter, and lookup in three commands.</div> </div> The rewrite also added local caching, multiple output formats, a Python API, and a CLI that works in scripts. It does not download raw files or process spectra. It finds metadata so you can decide what to download.</p> Publishing to PyPI changes what the tool is. Instead of "there is a script somewhere," it becomes pip install pxseek</code>. That matters for documentation. It matters for the next student, analyst, or lab member who needs to ask: "Which human cancer proteomics datasets are available?"</p> 📝</span> Why this mattered.</strong> I wanted to leave behind something the lab could keep using without needing me to explain a private script, a browser setup, or a set of manual steps. Small tools that sit between a research need and a little software care are worth packaging properly.</p> </div> </div> That is what the PyPI release is. A fragile script turned into something installable, repeatable, and independent of whoever wrote the first version.</p>Native R Extensions with Zig: Initial Observations 2026-05-23T00:00:00+00:00 People who know me are aware there is no great love lost between me and R. The language has quirks that are genuinely baffling to anyone arriving from nearly anywhere else: 1-indexed vectors, factors that quietly convert to integers at the wrong moment, a data.frame that is simultaneously a list and not a list depending on which function you are calling, and a scoping model that surprises you on a regular basis.</p> And yet I keep writing R packages. The tidyverse changed what it means to do data work in R. dplyr</code>, ggplot2</code>, the whole pipe-based philosophy, and the serious ongoing work on condition handling and user-facing error messages across the ecosystem. That side of R is genuinely well-designed. A community has built something worth using on top of a language that often fights you, and that deserves credit.</p> I also have real respect for where Rust is going. The ownership model is the most coherent answer I have seen to the problems that accumulated in C++: manual memory management buried under decades of abstraction, undefined behavior that compilers exploit rather than catch, and a template system that grew into something most people use but few people fully understand. Rust solves those problems correctly. The borrow checker is not inconvenient overhead. It is the point.</p> So when I started writing R extensions seriously, I tried both directions. And I kept running into the same friction.</p> Why Zig</h2> When you write an R extension, regardless of language, you end up at the same place: the R C API. SEXP values. PROTECT and UNPROTECT calls. .Call</code> as the entry point. R's garbage collector owns your objects and decides when they die. You do not own them. Your extension borrows them from R for the duration of a function call, and then R reclaims them.</p> This is where Rust's ownership story stops buying you what it normally buys. The borrow checker is built for a world where your code owns memory and you need to prove that ownership transfers correctly. In R extensions, R's GC is the owner. Your code is always a borrower. Both extendr</a> and Savvy</a> handle this through generated wrappers and careful lifetime management, and they do it well. But you are carrying real framework weight to solve a problem that is fundamentally about R's memory model, not Rust's.</p> 📝</span> This is not a criticism of extendr or Savvy. Both are serious projects and the benchmark results below show it. Savvy in particular is strong on BLAS and linear model work. The point is about fit: if you are already working at the C ABI boundary, a language that treats C interop as a first-class citizen with zero overhead has a natural advantage for this specific use case.</p> </div> </div> Zig treats C interop as a language feature, not a compatibility mode. You can call C functions directly, import C headers with @cImport</code>, and build a shared library that R loads with dyn.load()</code> exactly the way it loads a compiled C extension. No wrapper generation step. No framework between you and the R C API. Just Zig, compiling tight code to the same ABI that .Call</code> has always expected.</p> Zig also gives you comptime instead of macros, explicit allocators, no hidden control flow, and no hidden allocations. Every allocation is visible. Every error path is explicit. The compile-time guarantees catch things that would be silent bugs in C without requiring the ownership ceremony that Rust demands in a context where R is already the owner.</p> That is what zigr</a> is built toward.</p> The Benchmark Harness</h2> To test whether the fit translated into actual performance, I built a harness across six backends and twenty-three tasks. ◆</span> The benchmark code is not yet public. It will be released alongside zigr when the harness stabilizes. All six backends implement the same 23 tasks with the same data shapes and operation types. </span> </p> The six backends:</p> Runner</th> Implementation</th></tr></thead> r</code></td> Pure R reference baseline</td></tr> zigr</code></td> Native Zig via zigr</td></tr> c_call</code></td> Hand-written C via .Call</code></td></tr> rcpp</code></td> C++ via Rcpp</td></tr> extendr</code></td> Rust via extendr-generated wrappers</td></tr> savvy</code></td> Rust via Savvy-generated wrappers</td></tr> </tbody></table> All six currently pass all 23 shared tasks. Results were refreshed on 2026-05-23.</p> What the Numbers Say</h2> The runner summary first. Lower geomean is better. The Geomean vs R</code> column is the geometric mean of mean_ms / R_baseline_mean_ms</code> across all tasks, so values below 1.0x</code> are faster than R overall.</p> Runner</th> Tasks Won</th> Geomean vs R</th></tr></thead> zigr</code></td> 10</td> 0.475x</td></tr> Savvy</code></td> 3</td> 0.648x</td></tr> extendr</code></td> 2</td> 0.695x</td></tr> Rcpp</code></td> 4</td> 0.891x</td></tr> C .Call</code></td> 2</td> 0.935x</td></tr> R</code></td> 2</td> 1.000x</td></tr> </tbody></table> Benchmark summary</div> Figure 1</div> </div> </div> Figure 1. Geometric mean of mean_ms / R_baseline_mean_ms across 23 tasks. Values below 1.0x are faster than the R baseline. zigr leads at 0.475x.</p> </figcaption> </div> </figure> zigr leads overall and wins the most tasks. The geomean hides the actual story though, so here is the breakdown by category.</p> Vector, dataframe, and reduction work.</strong> This is where zigr separates most clearly. 05_dataframe</code> (filtering 1e5</code> rows) runs in 0.53 ms in zigr against 21.9 ms in R. 06_na_prop</code> (NA propagation across 1e6</code> values) is 0.15 ms in zigr against 9.7 ms in R. 14_rowsums</code> is 0.08 ms against 1.25 ms. These are not marginal wins. The work Zig does here is tight iteration over contiguous memory with no interpreter overhead, and R's overhead on large vectorized operations is real and measurable.</p> BLAS and linear algebra.</strong> Savvy is the surprise here. 10_blas_matmul</code> (256x256) runs in 1.18 ms for Savvy against 2.63 ms for zigr. Linear model fitting (13_lm</code>, n=5000, p=20) is 0.35 ms for Savvy against 0.40 ms for zigr. Both are calling into the same BLAS and LAPACK routines underneath. Savvy's generated wrapper appears to hit a faster dispatch path for this class of work, and understanding why is on the list.</p> Classic native extension work.</strong> Fibonacci, sort, random normal, element-wise ops. Rcpp and hand-written C are within noise of each other on most of these, and zigr is not dramatically ahead. This is expected: at this level all backends are bottlenecked on the same libm and sorting routines. The extension language stops mattering when the real work happens in a library call.</p> The two anomalies.</strong> 17_broadcast</code> (vector + scalar, 1e7</code>) and 19_cumsum</code> (cumulative sum, 1e7</code>) are tasks where zigr does not clearly separate from R. The numbers are 47 ms vs 49 ms on broadcast, and 50 ms vs 59 ms on cumsum. ◆</span> These two deserve their own investigation. Both operate on large contiguous vectors with simple arithmetic, which is exactly where you would expect Zig to pull ahead. Cache behavior or differences in auto-vectorization between R's bytecode compiler and Zig's codegen are the likely candidates. No clean answer yet. </span> </p> The two tasks R wins.</strong> 09_protect</code> (PROTECT stress, 49k objects) and 23_altrep_sum</code> (ALTREP sum over 1:1e7</code>). These are worth reading carefully rather than dismissing. ◆</span> I said at the top there is no great love lost between me and R. I am willing to say when I am wrong. Some base R internals are exceptionally well-tuned. The R team has spent real effort on the ALTREP sum path. An ALTREP vector carries metadata that lets operations like sum()</code> short-circuit to a formula rather than iterate. My zigr implementation does not exploit that yet. This is not a loss worth engineering around for v0.0.7. </span> </p> PROTECT stress measures call overhead more than computation. A small surface is being called many times and the overhead of entering and exiting the Zig shim accumulates. R wins here because there is no shim. That will always be true for micro-benchmarks of this shape.</p> Full results: mean wall time in ms, all 23 tasks</summary> Lower is better.</p> Task</th> R</th> zigr</th> C .Call</th> Rcpp</th> extendr</th> Savvy</th></tr></thead> 01_fib</td> 0.0038</td> 0.0021</td> 0.0021</td> 0.0017</td> 0.0042</td> 0.0019</td></tr> 02_vectorsum</td> 15.5244</td> 3.5422</td> 9.4884</td> 9.3098</td> 8.9240</td> 8.8798</td></tr> 03_transpose</td> 1.1433</td> 0.9474</td> 0.9850</td> 0.9755</td> 0.9957</td> 1.0388</td></tr> 04_strings</td> 2.2369</td> 1.1169</td> 1.2217</td> 0.9208</td> 0.9038</td> 0.8992</td></tr> 05_dataframe</td> 21.9324</td> 0.5260</td> 0.9870</td> 0.9909</td> 1.0147</td> 0.9692</td></tr> 06_na_prop</td> 9.6852</td> 0.1517</td> 2.7801</td> 2.7209</td> 2.8617</td> 2.8829</td></tr> 07_parallel</td> 15.5052</td> 2.2411</td> 2.8719</td> 2.7538</td> 2.5494</td> 2.5613</td></tr> 09_protect</td> 0.0016</td> 0.7407</td> 0.8232</td> 0.7896</td> 0.9051</td> 0.8812</td></tr> 10_blas_matmul</td> 2.7278</td> 2.6343</td> 2.9852</td> 2.9449</td> 3.8780</td> 1.1761</td></tr> 11_crossprod</td> 0.0728</td> 0.0677</td> 0.0668</td> 0.0715</td> 0.0681</td> 0.0689</td></tr> 12_cholesky</td> 0.7360</td> 0.5001</td> 0.6791</td> 0.5531</td> 1.0418</td> 0.5954</td></tr> 13_lm</td> 1.1645</td> 0.3996</td> 0.4264</td> 0.3907</td> 0.3922</td> 0.3506</td></tr> 14_rowsums</td> 1.2502</td> 0.0797</td> 0.1753</td> 0.1746</td> 0.1154</td> 0.1213</td></tr> 15_elem_ops</td> 86.8469</td> 31.1790</td> 30.4700</td> 30.0600</td> 33.1929</td> 30.4818</td></tr> 16_rowcol_means</td> 1.9085</td> 0.4686</td> 0.8153</td> 0.8132</td> 0.8161</td> 0.8177</td></tr> 17_broadcast</td> 48.9904</td> 47.2191</td> 48.1040</td> 48.4642</td> 94.5069</td> 96.0987</td></tr> 18_sort</td> 62.7667</td> 41.6258</td> 36.0792</td> 36.6330</td> 38.1290</td> 36.2273</td></tr> 19_cumsum</td> 58.8522</td> 50.3242</td> 51.5503</td> 51.5376</td> 121.2041</td> 104.3952</td></tr> 20_rnorm</td> 38.1920</td> 32.1525</td> 32.3429</td> 32.1031</td> 32.1622</td> 32.5613</td></tr> 21_string_nchar</td> 0.6656</td> 0.0817</td> 0.0832</td> 0.0811</td> 0.0659</td> 0.1310</td></tr> 22_which_na</td> 4.0595</td> 1.3015</td> 2.2631</td> 2.1946</td> 1.1840</td> 2.4138</td></tr> 23_altrep_sum</td> 0.0019</td> 0.0024</td> 8.9877</td> 9.0244</td> 0.0025</td> 0.0023</td></tr> 24_altrep_read</td> 0.0021</td> 0.0022</td> 0.0023</td> 0.0019</td> 0.0023</td> 0.0022</td></tr> </tbody></table> </div> </details> Beyond Speed</h2> Wall time is the easiest metric to collect and the easiest to misread alone. Three more dimensions are planned for the next harness version.</p> Memory.</strong> Peak RSS during each task. Heap bytes allocated, tracked through a custom allocator or mallinfo</code>. Leak detection: run each task 10,000 times in one session and watch the heap. GC pressure: how many GC cycles does each backend trigger? An extension that causes fewer allocations means fewer pauses in long-running R sessions even when individual call times look fast. ◆</span> GC pressure is the dimension where Zig's explicit allocator model has the most interesting potential. You can choose not to heap-allocate for many operations. Whether that translates into measurably lower GC interruption in practice is something the harness will measure directly. </span> </p> Safety.</strong> Crash resilience: what happens when an extension receives a NULL SEXP, the wrong SEXPTYPE, a vector with the wrong length, or a negative index? Does the R process crash or recover cleanly? Error propagation: does the extension call Rf_error()</code> and signal a proper R condition, or does it abort()</code>? R should survive an extension error without losing the session. PROTECT audit: static analysis using rchk</code></a>, which specifically checks PROTECT/UNPROTECT discipline in R extension C code. Passing rchk</code> clean is harder than it looks, and most extensions do not.</p> Developer experience.</strong> Compile time from a clean build. Binary size of the resulting .so</code>. Both matter for CRAN distribution where size limits are real and cold installs happen regularly. Cold start: dyn.load()</code> time in a fresh R session. Long-run stability: p99 latency over 50,000 task repetitions to surface latency spikes from background GC or OS scheduler interference.</p> The goal is a full four-axis comparison matrix: speed, memory, safety, and developer cost. The harness goes public when that picture is complete enough to be useful.</p> Design Constraints</h2> zigr is built around three explicit refusals.</p> It does not try to cover every SEXP type. The surface is narrow by intention: numeric vectors, matrices, dataframes, strings, and the foreign function call itself. That covers what appears in real scientific computing extensions. Everything else waits for a concrete use case.</p> It does not abstract the C ABI. The build artifact is a shared library that dyn.load()</code> treats identically to a compiled C extension. The call goes from R through .Call</code> directly into Zig. No generated wrapper layer. No code to read through when something goes wrong.</p> It does not grow for completeness. New types and operations go in when they are needed, not because coverage is a goal in itself. The narrow scope and the PROTECT safety guarantees are related: the smaller the surface, the more completely those guarantees can cover it.</p> Closing</h2> A lightning rod does two things. It carries the strike, and it keeps the structure standing.</p> That is what zigr is trying to be for R extension work. Bring the speed. Stop the bugs from burning the house down.</p> The benchmark harness will go public alongside the library. v0.0.7 is a proof of concept. The numbers suggest the direction is right.</p> QuEStVar v0.1.0 is out 2026-05-22T00:00:00+00:00 💡</span> pip install questvar[plot,yaml]</code></p> PyPI</a> · Docs</a> · GitHub</a> · Paper</a></p> </div> </div> The package grew out of the 2024 J. Proteome Res. paper</a>. The analysis lived in a monolithic GitHub archive until last week. v0.1.0 is the clean API extraction of that work.</p> The core point is one that keeps getting missed in omics: a non-significant t-test is not evidence of equivalence. It is a failure to reject. QuEStVar runs a TOST (two-one-sided t-test)</a> alongside the standard test so the result is always one of three states: differential, equivalent, or genuinely indeterminate. The indeterminate case is important to surface explicitly rather than letting it collapse into "not significant."</p> The other thing I spent time on is the exclusion tracking. Features excluded before testing (bad CV, missing values, zero variance) appear in their own panel with a breakdown by reason. Most tools drop them silently. Knowing why</em> a feature was excluded matters when you are interpreting the overall result counts.</p> Python</div> import polars as pl from questvar import QuestVar df = pl.read_csv("data/demo_realistic.tsv", separator="\t") qv = QuestVar(cv_thr=1.0, eq_thr=0.5, df_thr=1.0, p_thr=0.05, correction="fdr") results = qv.test(df, cond_1=["c1_0","c1_1","c1_2"], cond_2=["c2_0","c2_1","c2_2"]) print(results.summary())</code></pre> CS1: Minimal single-comparison run.</div> </div> Power analysis is included from the start because sample size planning under equivalence constraints is different from planning under difference constraints, and I kept getting that question after the paper.</p> The .plot()</code> call on any results object produces an eight-panel figure. Panel G is the exclusion breakdown.</p> QuEStVar v0.1.0</div> Figure 1</div> </div> </div> Figure 1.</strong> Summary plot from a single comparison.</p> </figcaption> </div> </figure> v0.1.0 is single-comparison only. Multi-comparison support (metadata-driven pair generation, batch execution) is next.</p> z-toml Is Stable and Passing 2026-05-20T00:00:00+00:00 💡</span> z-toml v0.4.0.</strong> TOML v1.1 parser and writer for Zig 0.16. Struct mapping and raw access, in-place rewriter, passes the toml-test corpus. Small enough to understand, stable enough to depend on.</p> </div> </div> z-toml has reached the point where I am comfortable calling it stable for my own use. That does not mean finished forever. It means the main shape is there, the important pieces are tested, and I can start depending on it without feeling like every small change may break the whole design.</p> Zig</div> const src = \\title = "My App" \\[server] \\port = 8080 ; const root = try toml.parseSlice(gpa, src, null); defer toml.deinit(root, gpa); const port = root.get("server").?.table.get("port").?.integer.value;</code></pre> CS1: Parse a TOML string and access values.</div> </div> I started z-toml because I wanted a TOML library I could understand, maintain, and use inside other Zig projects. I did not want configuration parsing to become a fragile side problem every time I worked on something that needed config files.</p> The main milestones: it parses TOML v1.1, writes TOML back out, supports typed parsing into Zig structs, and has an in-place rewriter for changing values without rebuilding the whole file ◆</span> The in-place rewriter is the feature I use most. It preserves comments and formatting for the parts of the file it does not touch, which means I can update a config value programmatically without blowing away the hand-written annotations around it. </span> . It also passes the toml-test validation suite ◆</span> toml-test is the official TOML validation corpus maintained by the TOML project. Passing it means the parser handles edge cases like inline tables, array-of-tables, dotted keys, and the various datetime formats correctly. It is the closest thing to a conformance guarantee a TOML library can have. </span> .</p> The library is starting to feel boring in the good way. I can parse a file, map it into a struct, write it back out, or rewrite one value by path. The tests pass. The behavior is documented enough that future me should not have to rediscover the whole project from scratch.</p> </blockquote> 📝</span> What is next.</strong> Better documentation, a small CLI, typed serialization, more examples. The library has crossed the line from experiment to usable dependency. The rest is polish, not discovery.</p> </div> </div> The project is small and it should stay small. It is not trying to become a full configuration framework. It parses, writes, preserves enough formatting for practical edits, and reports useful errors with line and column information. That is enough for now.</p> zebrac: Why I Forked poop 2026-05-05T00:00:00+00:00 zebrac is my fork of poop, Andrew Kelley's Linux benchmarking tool written in Zig. Andrew also created Zig, which is worth saying out loud because none of this exists without the language. I want to be clear about that from the start. poop already does the main interesting thing: it reports CPU performance counters and memory information for command benchmarks.</p> I did not fork it because I thought the original missed the point. I forked it because I started needing workflow features while comparing versions of my own tools.</p> 💡</span> What zebrac adds on top of poop.</strong> JSON output, configurable sample counts, warmup iterations, failure tolerance, better error messages, multi-architecture CI, and a cleaned-up build system. The core is still poop's. The additions are the parts I needed day to day.</p> </div> </div> The first thing I wanted was JSON output. Terminal output is good when I am watching the benchmark run. JSON is better when I want to save results, inspect them later, or feed them into a script. I do not want to manually copy numbers from the terminal every time I benchmark something.</p> I added warmups because the first few runs of a command can be strange. Cold caches, unloaded files, initial allocations. The first measurement is rarely representative. A --warmup</code> flag lets me discard those early iterations.</p> The other flags came from real use. --min-samples</code> and --max-samples</code> control how many measurements to collect. --allow-failures</code> lets me benchmark commands that sometimes return non-zero. Better error messages surface when a command cannot be executed at all instead of failing silently ◆</span> The error message improvements seem small, but they matter when you are benchmarking across a dozen command variants and one of them has a typo. You want to know which one failed and why, not guess. </span> .</p> Bash</div> zebrac --json results.json \ --warmup 3 --min-samples 5 \ './my_tool --input data.bin'</code></pre> CS1: Typical zebrac invocation.</div> </div> The build and CI side also got attention. zebrac tracks Zig through 0.12.0, 0.15.0, and now 0.16.0. The CI cross-compiles for x86-linux, x86_64-linux, aarch64-linux, and riscv64-linux. The build.zig</code> supports -Dstrip</code> and zig build release</code> for multi-target binary releases. None of this changes the benchmark output, but it means the tool can live where the data lives, not just on my machine.</p> There is more I want to do. The current version works for my workflow, but the gaps are visible.</p> I want better shell quoting support. Passing a command with pipes, redirects, or quoted arguments to a benchmarking tool is surprisingly fragile. The shell, the flag parser, and the command string all interact in ways that break on realistic input. I want zebrac to handle the common cases without wrapping everything in a script.</p> I want better support for comparing two or more commands side by side. Right now I run them separately and compare JSON outputs by hand. A built-in comparison mode would make the workflow tighter.</p> I want cleaner export formats. Markdown tables, CSV, maybe something that pastes cleanly into a project README or a release note.</p> I want baseline comparisons. Run a benchmark today, save it as a baseline, run again next week, and see what changed.</p> I also want better labels and result organization. When you are benchmarking five variants of the same tool across three inputs, keeping the results straight is its own problem.</p> 📝</span> What zebrac is not.</strong> It is not a replacement for hyperfine or a general benchmarking framework. It is a focused Linux tool built around performance counters. The additions I want to make are about workflow, not scope. I am trying to make the tool I reach for when I need to know whether my code got faster or slower.</p> </div> </div> The credit belongs to poop for proving that this could be a small Zig tool built around Linux performance counters. zebrac keeps that starting point, then adds the pieces I wanted for my own benchmarking. The next pieces are the ones I still reach for manually.</p> And thank you to Andrew for Zig. It makes projects like this one fun to build, and that matters.</p> The Month My Body Forced a Priority Review 2026-04-26T00:00:00+00:00 📝</span> Personal notes rarely appear here unless they connect to work, research, or something I built. Some things should stay private. This month was different. It forced me to reconsider what I was treating as important.</p> This is not meant to be a LinkedIn inspirational post, but a small keepsake of my thought process.</strong></p> What follows is what happened, what it made me think about, and why it changed how I look at the next part of my life.</p> </div> </div> The month that interrupted the plan</h2> I was supposed to be in transition mode. My current position was ending at the end of May, and interviews were moving in different directions. Some were early. Some were close to final. Data scientist roles, bioinformatician roles, even a few software engineering roles.</p> I wanted to keep my options open.</p> Then something unexpected happened. I will not share the medical details here, but I had to stay in the hospital for 16 days. It was uncomfortable, tiring, and much longer than expected.</p> A hospital stay also gave me too much time to think.</p> I was away from my normal routine. Work stopped. Planning stopped. A lot of the things that felt urgent before suddenly had to wait, because my body was the thing that needed attention first.</p> Fear was driving more than I realized</h2> Before all of this, I was worried about finding a job. That is a normal thing to worry about, especially when a position is ending and the next step is not fixed yet.</p> I was applying broadly because I did not want to miss anything. The strategy was flexibility. I was telling myself that flexibility was the smart thing to do.</p> But I didn't realize, some of it was due to fear.</p> Being stuck in a hospital bed made that easier to see. I had been so focused on not losing momentum that I stopped asking what kind of momentum I actually wanted.</p> </blockquote> Options mattered, but not every option was connected to the kind of life or work I care about.</p> That is hard to admit, but it is probably true.</p> I still care about making good things</h2> Being sick did not make me care less about work. It made me more honest about the kind of work I want.</p> I still care about making things that work. Useful software, meaningful science, and tools that help people do better work. Proteomics is where my expertise is, and the problems still feel worth the effort.</p> I do not want to drift into work that only looks good from the outside.</p> Work arrangements are not all the same to me:</p> Working from home is comfortable</li> Being close to the people I love matters</li> Having my own space, my own routine, maybe a cat nearby while I work matters</li> </ul> These are not small things to me.</p> After this month, I have less patience for onsite or hybrid requirements that seem to exist mostly for monitoring productivity or justifying office space. Not every job can be remote. Teams have different needs. But my health, comfort, and daily environment matter more than I was letting myself admit.</p> I do not want to work under duress</h2> I want to value happiness more than making money under duress. That sounds simple, but it is not simple when there are bills, uncertainty, and pressure to make the safest possible choice.</p> I still need to be practical.</p> At the same time, I do not want fear to choose everything for me:</p> I do not want to spend my energy on work that makes me unhappy if I have another choice</li> I do not want to work for a company that takes open-source projects, builds a product on top of them, and then slowly makes the product worse for the people who depended on the original work</li> I do not want to do all of this only after I am exhausted, scared, or unhappy</li> </ul> Maybe that sounds idealistic. Maybe it is.</p> I have too many ideas and opinions, and too little time. I want to keep building open-source tools. I want to keep writing. I want to keep contributing to science in a way that feels useful and honest.</p> That is the part that keeps coming back.</p> It made me think about fragile things</h2> This month made me think about the people I love. It made me think about how quickly normal life can be interrupted.</p> I do not want to turn sickness into a lesson too neatly. I do not think everything needs a clean meaning. Being sick was bad. Being a medically interesting case was not fun. Losing control over time, body, and plans was frustrating.</p> It still changed how I think.</p> I do not want my work to require me to ignore my life. I do not want ambition that only works when nothing goes wrong. I do not want to make plans that treat health, loved ones, and peace as things I can postpone until later.</p> </blockquote> I have done some of that before.</p> I do not want to keep doing it.</p> What I am taking from it</h2> I am not quitting ambition. I am trying to be more careful with it.</p> I still want to work hard. I still want to build things. I still want to contribute to science and open-source work. I still want to have a career that I can be proud of.</p> I just want those things to fit inside a life that I actually want to live.</p> That may mean being more selective. It may mean saying no to some options even if they look good from the outside. It may mean choosing the path that gives me more peace, more time near the people I love, and more room to make things that matter to me.</p> I do not know exactly what that looks like yet. But I know this:</p> I do not want fear to be the only reason I choose.</p> </blockquote> mzarc v0.0.1: Can Domain-Specific Compression Beat General-Purpose Codecs on Mass Spectra? 2026-03-16T00:00:00+00:00 📝</span> TLDR:</strong> mzarc is an early prototype of a domain-specific compression codec for mass spectrometry data in Zig. On one DDA dataset, lossless .mzv1</code> compresses to 27.89 MiB from 75.55 MiB mzML, beating gzip and trailing mzMLb by 11.64 MiB. Lossy at q=4096 hits 13.17 MiB with 0.218% p95 relative intensity error. Decode throughput is 167 MiB/s, roughly 27x faster than mzMLb. This is v0.0.1. One dataset. Scalar codec only. The remaining lossless size is almost entirely the exact m/z stream.</p> </div> </div> Why this exists</h2> I have been thinking about compression for mass spectrometry data since before I started learning Zig. The opinion pieces I wrote over the past few months keep circling the same idea: storage is an unaccounted cost, open formats inflate file sizes, and the tools to fix it exist but nobody adopts them.</p> At some point I decided to stop writing about it and start building.</p> mzarc is a question expressed as a codebase: can a codec that knows about mass spectra beat general-purpose compressors, and can it do it with decode speed fast enough to feed a search engine directly?</p> This is v0.0.1. The answer so far is partial. I am sharing it because the partial answer is interesting, and because the honest version of an early experiment is more useful than a polished launch.</p> What this is</h2> mzarc is a domain-specific, asymmetric compression codec for mzML-derived mass spectrometry spectra. Asymmetric means encode can be slow. Decode must be fast.</p> The pipeline right now is deliberately narrow:</p> Bash</div> mzML -> Python dump tool -> flat binary dump -> Zig codec -> .mzv1 file</code></pre> CS1: Ingestion pipeline.</div> </div> Python handles mzML ingestion once. Zig handles the transform and decode repeatedly. This keeps XML parsing out of the codec work. The binary dump is an internal handoff format, not a format anyone should use directly.</p> The codec stack itself is four scalar transforms composed in sequence:</p> Quantize:</strong> m/z to fixed-point; intensity to log-scale at configurable q.</li> Delta:</strong> intra-spectrum delta coding on sorted m/z arrays.</li> FOR bitpack:</strong> frame-of-reference packing with per-spectrum bit widths.</li> Block:</strong> 128 spectra per block, CRC32 validated, MS1 and MS2 in separate block streams.</li> </ol> ◆</span> 128 spectra per block fits roughly 14 KB at typical peak density. That sits inside L1 cache, which matters for decode throughput. </span> That is the whole thing. No entropy coding yet. No SIMD. No cross-spectrum delta. Those come later, if the scalar baseline justifies them.</p> What I measured</h2> One dataset: 15HCD_1 ◆</span> From PXD075509. 9001 spectra, 2,668,458 total peaks, 917 MS1, 8084 MS2. DDA acquisition on a Thermo instrument. </span> . One machine. Ten repeat runs per operation. Benchmarked against mzMLb, MScompress, gzip, and zstd. All tools got the same dump as input so the comparison isolates the codec, not the XML parsing.</p> Size</h3> size</div> Figure 1</div> </div> </div> File size comparison on 15HCD_1. Lossless mzv1 (27.89 MiB) beats gzip (19.82 MiB) and the dump itself (30.78 MiB). Lossy at q=4096 (13.17 MiB) is smaller than mzMLb (16.25 MiB).</p> </figcaption> </div> </figure> Artifact</th> Size</th> vs mzML</th> vs dump</th></tr></thead> :---</td> ---:</td> ---:</td> ---:</td></tr> mzML</td> 75.55 MiB</td> 100%</td> 245%</td></tr> dump (binary flat)</td> 30.78 MiB</td> 41%</td> 100%</td></tr> mzv1 lossless</strong></td> 27.89 MiB</strong></td> 37%</strong></td> 91%</strong></td></tr> mzv1 lossy q=4096</strong></td> 13.17 MiB</strong></td> 17%</strong></td> 43%</strong></td></tr> gzip dump</td> 19.82 MiB</td> 26%</td> 64%</td></tr> zstd dump</td> 17.85 MiB</td> 24%</td> 58%</td></tr> mzMLb</td> 16.25 MiB</td> 22%</td> 53%</td></tr> MScompress</td> 21.63 MiB</td> 29%</td> 70%</td></tr> </tbody></table> The lossless path clears a low bar: it beats gzip and is smaller than the internal dump itself. That means the transforms are doing real work, not just shuffling bytes. It trails zstd on the dump and mzMLb, which is expected. mzMLb uses HDF5 with blosc:zstd compression. mzarc v0.0.1 uses scalar FOR bitpacking with no entropy coding. The gap is roughly 11.64 MiB, and most of it is in one place ◆</span> The m/z stream. 17.50 MiB of the 27.89 MiB lossless file. That single stream is larger than the entire mzMLb file (16.25 MiB). If entropy coding can cut it in half, the lossless path wins. </span> .</p> Where the bytes go</h3> Stream</th> Lossless</th> Lossy q=4096</th></tr></thead> :---</td> ---:</td> ---:</td></tr> Structural</td> 0.04 MiB (0.1%)</td> 0.04 MiB (0.3%)</td></tr> Spectrum metadata</td> 0.17 MiB (0.6%)</td> 0.17 MiB (1.3%)</td></tr> m/z stream</td> 17.50 MiB (62.8%)</td> 9.17 MiB (69.6%)</td></tr> Intensity stream</td> 10.18 MiB (36.5%)</td> 3.80 MiB (28.8%)</td></tr> </tbody></table> The m/z stream is 17.50 MiB in lossless mode. That is 62.8% of the total file. The intensity stream shrinks from 10.18 MiB to 3.80 MiB under lossy quantization, exactly as designed. The m/z stream barely moves between lossless and lossy because the current quantizer preserves m/z exactly in both paths. Fixing this requires either lossy m/z quantization with controlled bounds, or an entropy coding layer that compresses the delta-encoded m/z residuals. Both are on the list. Neither is in v0.0.1.</p> Throughput</h3> throughput</div> Figure 2</div> </div> </div> Throughput in MiB/s. mzv1 encode and decode both exceed 120 MiB/s. mzMLb decode (6.2 MiB/s) is 27x slower.</p> </figcaption> </div> </figure> Operation</th> Throughput</th> Time</th></tr></thead> :---</td> ---:</td> ---:</td></tr> mzv1 lossless encode</td> 123.7 MiB/s</td> 0.25s</td></tr> mzv1 lossless decode</td> 167.2 MiB/s</td> 0.18s</td></tr> mzv1 lossy encode</td> 128.9 MiB/s</td> 0.24s</td></tr> mzv1 lossy decode</td> 167.9 MiB/s</td> 0.18s</td></tr> mzMLb encode</td> 3.4 MiB/s</td> 22.5s</td></tr> mzMLb decode</td> 6.2 MiB/s</td> 5.0s</td></tr> MScompress encode</td> 105.6 MiB/s</td> 0.72s</td></tr> MScompress decode</td> 4.3 MiB/s</td> 7.2s</td></tr> zstd dump decode</td> 579.7 MiB/s</td> 0.05s</td></tr> </tbody></table> mzv1 decode at 167 MiB/s is 27x faster than mzMLb decode. It is slower than zstd on the dump, which is expected: zstd has years of SIMD-optimized C. The scalar Zig codec has none. The question is whether the gap closes with entropy coding and SIMD, or whether general-purpose compressors will always be faster on decode. I do not know yet.</p> MScompress decode at 4.3 MiB/s is the slowest path in the benchmark. Its threaded encode hits 587 MiB/s, which is impressive, but the asymmetry is in the wrong direction for a format designed to be decoded many times.</p> How to reproduce these benchmarks</summary> Run from the repository root:</p> uv run python tools/benchmark_v1.py \ --repeats 10 \ --external-baselines mzmlb,mscompress \ --mscompress-benchmark-threaded \ data/PXD075509/15HCD_1.mzML </code></pre> Output goes to benchmark/report.json</code> and benchmark/report.md</code>. Plots land in benchmark/plots/</code>. The data shown here is from commit 93459bd5</code>.</p> </div> </details> Fidelity</h3> fidelity</div> Figure 3</div> </div> </div> Fidelity overview. Lossless paths are exact. Lossy q=4096 shows controlled intensity error with m/z error at the ppm level.</p> </figcaption> </div> </figure> Lossless mzv1 round-trips exactly. Every m/z value and every intensity survives encode-decode unchanged ◆</span> This is tautological by design. A lossless codec that changed data would be a bug. The claim matters only because it separates the codec correctness from the quantization question, which is where the interesting tradeoffs live. </span> . So does the original scan order.</p> Lossy at q=4096: max absolute m/z error is 1.0e-06 Da. Mean absolute intensity error is 695 (raw counts). P95 relative intensity error is 0.218%. P99 is 0.238%. These are controlled errors within the quantization bounds. The lossy tradeoff sweep makes this explicit:</p> tradeoff</div> Figure 4</div> </div> </div> Lossy tradeoff. Higher q preserves more precision at modest size cost. q=16384 gives p95 error of 0.055% at only 0.64 MiB more than q=4096.</p> </figcaption> </div> </figure> q</th> Size</th> P95 rel intensity error</th> P99 rel intensity error</th></tr></thead> ---:</td> ---:</td> ---:</td> ---:</td></tr> 256</td> 11.90 MiB</td> 3.499%</td> 3.813%</td></tr> 1024</td> 12.54 MiB</td> 0.874%</td> 0.950%</td></tr> 4096</td> 13.17 MiB</td> 0.218%</td> 0.238%</td></tr> 16384</td> 13.81 MiB</td> 0.055%</td> 0.059%</td></tr> </tbody></table> At q=16384 the p95 error is 0.055%. The file is 0.64 MiB larger than q=4096. For archival storage, that cost is near zero.</p> What I did not measure</h2> Search impact. This is the measurement that matters most. Do peptide identifications and FDR estimates change after a round-trip through lossy mzv1? The benchmark tracks numeric fidelity. It does not track downstream biological conclusions. That requires running a search engine on the original and round-tripped spectra and comparing the results. I have not done that yet.</p> Assumptions</h2> Several assumptions are baked into v0.0.1 that may turn out to be wrong:</p> One dataset is representative.</strong> 15HCD_1 is DDA on a Thermo instrument. DIA data looks different. timsTOF data looks different. Profile-mode data is an order of magnitude larger. Every conclusion in this post is conditional on one file.</li> The dump is a fair baseline.</strong> Stripping XML overhead is an obvious first step. It is not a format. The dump baseline shows how much of the size reduction is just removing interchange overhead vs actual compression. The gap between mzML (75.55 MiB) and the dump (30.78 MiB) is the XML tax. The gap between the dump and mzv1 lossless (30.78 to 27.89 MiB) is the codec doing real work. That gap is 2.89 MiB. It is real. It is small.</li> Scalar FOR is enough.</strong> The current size is dominated by the exact m/z stream. Entropy coding (rANS, tANS) should shrink that stream significantly. Cross-spectrum delta should help on DIA where consecutive spectra share precursors. Both are unimplemented. If they do not close the gap to mzMLb, the thesis is in trouble.</li> Python is an acceptable dependency.</strong> The prototype ingests mzML through pyteomics. This is fine for benchmarking. It is not fine for production. A native Zig mzML reader is on the roadmap. It is not in v0.0.1.</li> Decode speed matters more than encode.</strong> This is the asymmetric design bet. Encode happens once per file. Decode happens every time a search engine reads the data. If decode is not fast enough to feed a search engine without becoming the bottleneck, the format is not useful.</li> </ul> Limitations</h2> This is v0.0.1. The list of things not yet done is longer than the list of things done:</p> One dataset. No DIA. No timsTOF. No multi-instrument validation.</li> No entropy coding. The m/z stream is delta-encoded and FOR-packed but not entropy-coded. That is the largest remaining compression opportunity.</li> No SIMD. The decode path is scalar. SIMD FOR unpack should roughly double decode throughput.</li> No cross-spectrum delta. Consecutive DIA spectra share precursors. Encoding differences between spectra rather than absolute values should reduce the m/z stream significantly.</li> No search impact measurement. Numeric fidelity is not the same as biological fidelity.</li> No native mzML reader. Python dependency for ingestion.</li> No comparison against MS-Numpress. It is the most natural baseline for array-level compression inside mzML and should be added to the benchmark.</li> </ul> What comes next</h2> The immediate next steps are narrow:</p> Fix the m/z stream. Entropy coding first. RANS or tANS. If the m/z stream shrinks from 17.50 MiB to something closer to 5-7 MiB, the lossless path beats mzMLb. That is the threshold that decides whether to keep going.</li> Add a second dataset. DIA on a different instrument. If the codec assumptions break on DIA data, that is better learned now than after months of optimization.</li> Measure search impact. Run MSFragger or DIA-NN on original and round-tripped spectra. If peptide IDs and FDR are unchanged at q=16384, the lossy path is viable. If they drift at any q, the quantization scheme needs revision.</li> </ol> After that, the project either has evidence to continue or evidence to stop.</p> Where this fits</h2> mzarc is the second Zig tool I have shipped ◆</span> The first was z-fasta</a>, a FASTA indexer that runs 9-17x faster than samtools. I wrote about it here</a>. </span> . z-fasta proved that a single static binary could beat established tools on a narrow, well-defined problem. mzarc is trying to prove something harder: that domain-specific encoding can beat general-purpose compression on a format that matters to my field.</p> The honest assessment after v0.0.1 is that the thesis is not yet proven. The lossless path trails mzMLb. The decode path is fast but scalar-only. The dataset coverage is one file. The search impact is unmeasured.</p> But the architecture is sound. The codec composes cleanly. The transform chain round-trips exactly. The benchmark pipeline is reproducible. The byte accounting points directly at the remaining gap. These are good foundations.</p> The next version will either close the gap to mzMLb or explain why it cannot. Either outcome is useful.</p> Open source (MIT) at github.com/eneskemalergin/mzarc.</p> z-fasta: Indexing FASTA 17x Faster, and All the Things It Still Cannot Do 2026-02-28T00:00:00+00:00 📝</span> TLDR:</strong> z-fasta indexes FASTA files 9-17x faster than samtools while producing byte-identical .fai</code> output. It is a zero-dependency static binary written in Zig. It handles 20/20 edge cases correctly. It has a streaming mode that uses 4 MB of RAM.</p> </div> </div> Why this exists</h2> samtools faidx</code> is the standard. It works. It is correct. It is also slow.</p> On a 3 GB human genome, samtools faidx</code> takes 9.2 seconds on warm cache. Run it once and you do not notice. Run it in a pipeline that indexes hundreds of files and you wait.</p> I had been learning Zig for a few months. The language's strengths (no hidden control flow, explicit memory, direct SIMD) map directly onto the problem. A FASTA indexer is not complex. Scan bytes for lines starting with ></code>, record offsets, write them out. The bottleneck is how fast you move bytes from disk to CPU.</p> I wanted to see how close to the hardware limit I could get.</p> What z-fasta does</h2> z-fasta is a drop-in replacement for samtools faidx</code>. It emits byte-identical .fai</code> output. It also writes .zfi</code>, a compact binary index for programmatic use.</p> Bash</div> zig build -Doptimize=ReleaseFast # Emit samtools-compatible .fai to stdout z-fasta index --emit-fai genome.fa > genome.fai # Or create a binary .zfi index z-fasta index genome.fa</code></pre> CS1: Build and run.</div> </div> Three modes:</p> Default:</strong> mmap + SIMD scanning, with duplicate header detection.</li> No-dedup:</strong> mmap + SIMD, no duplicate tracking. Fastest. Use when you trust your input.</li> Low-memory:</strong> chunked reader, 4 MB buffer. For constrained machines where mmap is not available.</li> </ul> Performance</h2> Benchmarked against samtools, seqkit (Go), and fastahack (C++) on three real datasets. All tests on an AMD Ryzen 9 3950X with warm cache, using hyperfine.</p> benchmark</div> Figure 1</div> </div> </div> Indexing time in seconds across genome (3.0 GB), proteome (66 MB), and transcriptome (972 MB) datasets. Lower is better.</p> </figcaption> </div> </figure> Dataset</th> Size</th> z-fasta (no-dedup)</th> samtools</th> seqkit</th> fastahack</th> Speedup vs samtools</th></tr></thead> :---</td> ---:</td> ---:</td> ---:</td> ---:</td> ---:</td> ---:</td></tr> Genome</td> 3.0 GB</td> 0.57s</td> 9.15s</td> 5.42s</td> 21.71s</td> 16.1x</strong></td></tr> Transcriptome</td> 972 MB</td> 0.10s</td> 1.79s</td> 1.76s</td> 5.51s</td> 17.5x</strong></td></tr> Proteome</td> 66 MB</td> 0.006s</td> 0.05s</td> 0.11s</td> 0.25s</td> 9.4x</strong></td></tr> </tbody></table> z-fasta is faster than every tool on every dataset. The gap widens with file size.</p> scaling</div> Figure 2</div> </div> </div> Indexing time vs file size. z-fasta and samtools both scale linearly. z-fasta’s slope is an order of magnitude shallower.</p> </figcaption> </div> </figure> z-fasta and samtools both scale linearly with file size. z-fasta's slope is an order of magnitude shallower. At 1 GB, samtools takes 3 seconds. z-fasta takes 0.2 seconds.</p> Sequence count has almost no effect. At 100,000 sequences, z-fasta takes 0.02 seconds in no-dedup mode. The work is I/O-bound, not header-bound.</p> How it works</h2> SIMD newline scanning</h3> A FASTA indexer spends almost all of its time looking for \n</code>. z-fasta uses Zig's @Vector</code> types to scan 32 bytes at a time. On x86_64 this compiles to AVX2 vector compares. A 3 GB genome is one pass at memory bandwidth.</p> mmap by default</h3> z-fasta memory-maps the entire file. The OS handles buffering. The CPU sees a flat byte array. No read()</code> calls in userspace. No buffer management.</p> The tradeoff is that time</code> reports VmRSS equal to the file size. The OS maps the file to virtual memory and time</code> counts it ◆</span> The working set during indexing is a fraction of what VmRSS reports. The OS does not actually read the whole file into RAM. </span> . Actual private heap allocation is small: roughly 45 MB for the header hash map in default mode, under 1 MB in no-dedup mode.</p> If you cannot afford even the virtual memory footprint, --low-mem</code> switches to a chunked reader with a 4 MB buffer. It is 3-4x slower than mmap but uses essentially no memory.</p> Mode</th> Time (Genome)</th> Heap</th> RSS (reported)</th></tr></thead> :---</td> ---:</td> ---:</td> ---:</td></tr> no-dedup</td> 0.57s</td> < 1 MB</td> ~3 GB (mmap)</td></tr> default</td> 0.57s</td> ~45 MB</td> ~3 GB (mmap)</td></tr> low-mem</td> 2.44s</td> 4 MB</td> 4 MB</td></tr> samtools</td> 9.15s</td> ~3 MB</td> ~3 MB</td></tr> </tbody></table> Correctness</h2> Speed means nothing if the output is wrong. I tested z-fasta against samtools on 20 edge cases: zero-byte files, missing trailing newlines, mixed \r\n</code> endings, unicode headers, binary garbage mid-file, tab characters in sequence names, sequences with no line wrapping.</p> correctness</div> Figure 3</div> </div> </div> Edge case heatmap. Green = pass, red = fail. z-fasta matches samtools on all 20 cases, including exit codes for errors.</p> </figcaption> </div> </figure> Result: 20/20 edge cases match samtools behavior exactly</strong>, including exit codes for error cases. seqkit silently accepts some malformed inputs that both samtools and z-fasta reject.</p> Honest Limitations</h2> This is a proof of concept. It indexes FASTA and nothing else.</p> It was built with Zig 0.14.0 ◆</span> The language is still pre-1.0 and changes between releases can break builds. Migrating to a newer version means updating the build.zig and adapting to any stdlib changes. The core SIMD and mmap logic is portable, but the build configuration and CLI parsing are tied to the version used here. </span> . The language moves fast and I have not migrated to a newer version yet. The build will break if you try with a different Zig release.</p> No gzip support. No FASTQ support. No BED. No sub-sequence extraction. The benchmarks show impressive numbers because the tool does one narrow thing and does not worry about the rest. That is honest but it is also limited.</p> The Rust ecosystem has several FASTA indexing libraries. rust-htslib</code> wraps htslib and provides FASTA indexing through it. needletail</code> is a streaming FASTA/Q parser with speed claims. Both are API libraries, not CLI tools. I chose not to deal with Cargo build complexity and Rust's CLI tooling ecosystem for what I wanted as a learning project. That is a personal constraint, not a technical judgment.</p> What is next</h2> The current v0.1.0 only indexes. The repository ◆</span> github.com/eneskemalergin/z-fasta</a>. The README has a more detailed roadmap. </span> already has a roadmap for what comes after:</p> z-fasta get</code></strong> - O(1) sub-sequence extraction by name or region. The other half of samtools faidx</code>.</li> z-fasta bed</code></strong> - Extract sequences for every entry in a BED file in a single pass.</li> z-fasta digest</code></strong> - In-silico trypsin digestion. If the scanner already moves through FASTA at memory bandwidth, computing peptide masses during the scan is a natural extension.</li> Gzip support</strong> - Requires a decompression library. I have not committed to the complexity yet.</li> </ul> These exist as plans, not code. The tool is fast at indexing. It needs to be useful at more than that before it replaces anything in a real pipeline.</p> Where I think this can go</h2> z-fasta is a small tool that does one thing correctly and fast. It is also the first step in a larger idea: a suite of high-performance bioinformatics utilities in Zig. The opinion pieces I have been writing argue that Zig fits the boring foundation layer: parsers, indexers, validators. z-fasta is the first proof that the performance argument holds.</p> It is not a replacement for samtools. Not yet. Maybe not ever. It is a demonstration that a small, focused tool in a systems language can beat a mature, general-purpose tool on its own turf. Whether that matters depends on whether the rest of the functionality gets built.</p> Open source (MIT) at github.com/eneskemalergin/z-fasta.</p> The Storage Crisis Nobody Budgets For 2026-02-21T00:00:00+00:00 📝</span> TLDR:</strong> Proteomics data is growing faster than the storage budgets that are supposed to hold it. New instruments produce deeper coverage per run. Single-cell work multiplies the sample count. AI demand is driving up the cost of storage hardware. The format debate between XML and binary is a distraction. Total data volume is the real problem, and nobody is accountable for it.</p> </div> </div> I built a storage server for the lab six months ago. 86 TB of RAID capacity. An Orbitrap Astrals and a timsTOF Ultra 2 feed into it. It is already 60% full.</p> There is cold storage for older runs. We also lost data there, so it is not quite a solution.</p> This is not an isolated story. A Reddit user processing Astral data put it bluntly: "You'll need to find a data storage solution because buying 10TB hard drives isn't sustainable." This is from someone who just bought a multi-million dollar instrument. The storage problem was an afterthought.</p> This is not a dramatic story. It is math.</p> The numbers are not on our side</h2> Per-run depth is increasing fast. The Orbitrap Astral can identify over 8,000 protein groups from a single HeLa run and over 15,000 from a fractionated sample in under 5 hours ◆</span> Thermo's Orbitrap Astral datasheet</a> claims >8,000 protein groups from a 5.5-min HeLa run. Confirmed independently by Nature Communications (2024)</a> mapping ~30,000 phosphosites in 30 minutes. The "eight proteomes per day" figure cites Jesper Olsen's group at the Copenhagen CP. </span> . The timsTOF Ultra 2 is competitive on coverage.</p> More spectra per run means larger files. An Astral DIA file is around 15 GB ◆</span> Confirmed by multiple user reports. A DIA-NN GitHub discussion (#973, March 2024)</a> documents an Exploris 480 DIA file at 2-3 GB versus an Astral DIA file at ~15 GB. The jump is roughly 5-7x per file between generations. </span> . The previous generation (Exploris 480) produced files in the 2-3 GB range. Same experiment. Five times the storage.</p> Bash</div> # One Astral run 15 GB experiment_01.raw # 100 runs per month 1500 GB monthly_raw # With converted mzML (uncompressed, ~10x expansion) 15000 GB monthly_mzML # After a year (uncompressed mzML) 180 TB annual_mzML</code></pre> CS1: How 15 GB per run becomes a problem nobody planned for.</div> </div> Sample counts are exploding. A single-cell proteomics experiment can produce thousands of individual measurements across many acquisitions. Terabytes from one study. The field is moving from dozens of samples to hundreds, to thousands. The storage footprint tracks every run linearly.</p> Multi-omics compounds the problem. Genomics has its own storage crisis. When the same study collects proteomics, transcriptomics, and metabolomics, the storage demand multiplies across modalities.</p> The result: a single large study can produce 50 terabytes of raw data. Most of it will never be looked at again after the paper is published. All of it has to be stored somewhere.</p> The PRIDE repository ◆</span> Perez-Riverol et al., NAR 53(D1), 2025</a>. PRIDE receives 534 new datasets per month. 47% of all datasets were submitted in the last three years. Growth is accelerating. </span> receives 534 new datasets per month. Globus was added as a transfer protocol because FTP could not handle the file sizes.</p> Less than 10% of PRIDE's public datasets are ever reanalyzed ◆</span> From the 2025 PRIDE update paper</a>: "Overall, the number of datasets mentioned as reanalyzed is <10% of the PRIDE public datasets." Measured by counting dataset accession mentions in EuropePMC. </span> .</p> The conversion penalty</h2> Converting vendor formats to open formats is the right thing to do. The default conversion settings also make your storage problem measurably worse.</p> The mzML format is verbose by design. The MS-Numpress paper ◆</span> Teleman et al., MCP (2019)</a>. A naive mzML representation can be 4-fold to 18-fold larger than the vendor original. The paper also developed the MS-Numpress compression schemes that fix this. </span> documented this: a naive mzML conversion grows the file by 4x to 18x compared to the vendor original.</p> That expansion is not uniform. It depends on the vendor format. Thermo .raw files are compact binary containers. Converting them to uncompressed mzML creates the largest expansion. Bruker timsTOF .d files are already a directory of binary files (TDF/TSF). The expansion from Bruker .d to mzML is less dramatic, and many tools ◆</span> FragPipe's docs</a> explicitly recommend against converting .d: "we recommend using the raw .d format for Bruker data." DIA-NN</a> and MSFragger/IonQuant</a> all read .d natively. Thermo .raw users do not have this option. </span> can read .d natively anyway. The problem is most acute for Thermo users, which is still the majority of the installed base.</p> Bash</div> # Default conversion (no compression) wine msconvert experiment.raw --mzML # Result: 150 GB mzML from a 15 GB raw # With MS-Numpress + zlib wine msconvert experiment.raw --mzML \ --zlib --numpress linear \ --numpress short logged # Result: ~20 GB mzML, comparable to original raw</code></pre> CS2: The same data, two conversion paths.</div> </div> This is not an argument against open formats. It is an argument that open formats need to be compact by default. mzMLb ◆</span> Bhamber et al., JPR (2021)</a>. HDF5-based format storing spectra as compressed datasets with XML metadata. Achieves file sizes comparable to vendor formats. Published, standardized, included in ProteoWizard. Rarely used. </span> solves the compression problem while keeping metadata accessible. MS-Numpress reduces mzML size by roughly 61% alone, up to 87% with zlib, and improves read speed by 21% in some configurations.</p> The tools exist. They are not the default.</p> The AI tax</h2> Storage has historically gotten cheaper. That long trend is not guaranteed to continue.</p> AI demand is disrupting the hardware supply chain in ways that hit labs buying storage right now. NAND flash prices increased by roughly 246% during 2025 according to Kingston's end-of-year report ◆</span> Kingston's Cameron Crandall reported NAND wafer pricing up 246% from Q1 2025</a>. Forbes (January 2026)</a> confirmed some NAND prices more than doubled in under six months. TrendForce projects</a> NAND prices rising another 33-38% QoQ in Q1 2026. This is structural, not cyclical. </span> . SSDs that were $175 are now $379. 1TB drives that were $40-50 are more than double.</p> Cloud providers buy drives by the exabyte. GPU manufacturers allocate supply to the AI market first. The downstream effect: the same components proteomics labs depend on cost more than they did two years ago.</p> A qualified objection: most proteomics archives live on HDD arrays or tape, not high-performance SSDs. NAND prices affect the active storage layer (SSD caching, high-speed analysis nodes) more directly than cold archives. A lab storing everything on spinning disk is partially insulated from NAND volatility.</p> The broader point stands. Storage infrastructure of all types is getting more expensive. HDD prices are also rising as manufacturers shift factory capacity to meet AI demand. Cloud pricing is complex. Egress and retrieval fees often dwarf storage costs ◆</span> S3 Glacier Deep Archive is ~$0.00099/GB/month</a>, but restoring large datasets costs $0.02-0.03/GB in retrieval fees plus hours of wait time. A 10 TB restore costs ~$200 before egress. As LeanOps puts it: "$1/TB to store, $20K to retrieve" a petabyte. </span> . The headline storage rate understates the true cost of keeping data accessible.</p> A lab buying a storage server in February 2026 is paying more for less capacity than they would have in 2024. That is not speculation. It is the NAND spot price.</p> Nobody is accountable</h2> Instrument vendors sell instruments. They do not pay for the storage that holds the data their instruments produce. Software vendors sell analysis tools. They do not pay for storage of intermediate and output files. Grant budgets include line items for instruments and compute. They rarely include realistic line items for long-term data retention.</p> How the NIH policy handles storage costs</summary> The NIH Data Management and Sharing Policy (effective January 2023) requires a DMS plan for all grant applications. Storage costs can be budgeted during the project period. After the grant ends, the data must persist. The funding does not.</p> Some institutions provide repository funding, core facility support, or infrastructure grants. The gap is not absolute. It is structural and widespread enough that most labs feel it.</p> </div> </details> To put numbers on the silence: the UAB Targeted Metabolomics and Proteomics Laboratory estimated that their TripleTOF 5600 generated 1-2 TB of raw data per month ◆</span> UAB TMPL data storage page</a>. At their quoted price ($0.15/GB/month), projected cost was $80,000/year. At modern S3 pricing (~$0.023/GB/month), raw storage drops to ~$550/year. The real costs are elsewhere: backup, replication, metadata, retrieval, and the sysadmin time to maintain it all. </span> . Their projected cost was $80,000 per year. At modern cloud rates, the raw storage is cheap. The real cost is everything around it.</p> The hidden expense is operations. Sysadmin time. Backup validation. Data migrations across storage generations. Security compliance. These costs scale with data volume and easily exceed hardware. A facility generating 50 TB/year might spend more on the person managing it than on the disks holding it.</p> The incentive structure reinforces the gap. Nobody gets a paper for efficient storage. The MS-Numpress authors published their work and the tools are available in ProteoWizard. The default conversion settings most researchers use do not enable compression. Journals require data deposition but do not fund the infrastructure. A handful of datasets get most of the download traffic. The rest sit.</p> The field optimizes for generation. Not retention.</p> </blockquote> The problem nobody wants to talk about</h2> Deletion is politically harder than storage.</p> Every dataset has an owner. Nobody wants to approve deletion. Nobody wants responsibility if the data becomes useful later. The result is a hoarding equilibrium: keep everything because the cost of deleting the wrong thing is higher than the cost of keeping it.</p> This is exactly why storage crises emerge gradually. No single decision creates them. They are the accumulated weight of decisions deferred. The server fills up not because someone chose poorly, but because no one chose at all.</p> Retention policies exist on paper. Enforcement is rare. The field has no culture of intentional deletion. We keep everything until the server forces a decision.</p> What would actually help</h2> The format debate between XML and binary is a distraction. Total data volume is the real problem, and it grows regardless of encoding choice. A 50-terabyte study is large in any format.</p> Compression tooling is the most direct lever. MS-Numpress ◆</span> Teleman et al., MCP (2019)</a>. Reduces mzML by ~61% alone, up to 87% with zlib. Also improves read speed by 21%. Ships with ProteoWizard, enabled by a single flag. </span> already works. mzMLb matches vendor format sizes. StackZDPD ◆</span> StackZDPD, Nature Scientific Reports (2022)</a>. Alternative encoding using difference encoding + zstd. Reduces mzML volume by ~80% with faster decompression than zlib. </span> offers similar ratios. The gap is not invention. It is adoption. Making compressed, indexed formats the default rather than a niche option would reduce storage pressure across the entire field.</p> Will Kryder's Law save us? Probably not this time. Storage density improvements have slowed. The transition from PMR to HAMR has been slow. Meanwhile, instrument throughput is accelerating faster than density improvements. Retention obligations never expire. Raw files are rarely deleted. Reprocessing requirements preserve originals. The old pattern of "storage will catch up" is not keeping pace with how fast this field generates data.</p> Retention policies need to become explicit. Not every file lives forever. Raw instrument data that has been processed and verified could move to cold storage after a defined window. Search engine intermediates can be regenerated.</p> Bash</div> # S3 lifecycle rule: raw to cold to delete { "Rules": [ { "Id": "proteomics-retention", "Status": "Enabled", "Transitions": [ {"Days": 90, "StorageClass": "STANDARD_IA"}, {"Days": 365, "StorageClass": "GLACIER"} ], "Expiration": {"Days": 1825} } ] }</code></pre> CS3: A lifecycle policy for proteomics data.</div> </div> Cloud storage tiers make this practical. AWS S3 Glacier Deep Archive costs roughly $0.001/GB/month, compared to $0.023/GB/month for Standard. 20x difference for data accessed once a year or less. The field has no standard for retention. Every lab reinvents the policy, or more often, has no policy at all.</p> Where I land</h2> Storage is not a glamorous problem. It is a maintenance problem. The kind that gets ignored until the server is full and someone has to spend a day deciding what to delete.</p> The cost is real and growing. Instruments produce more data per run. Experiments include more runs. AI demand pushes hardware prices up. Open formats are necessary but inflate storage requirements when used without compression.</p> The solutions exist. Compression works. Tiered storage policies work. What is missing is the incentive to adopt them, the tooling to make them easy, and the willingness to treat storage as a first-class budget item rather than an afterthought.</p> This problem will get worse before it gets better. There are opportunities to build tools that make it better. Small, fast, format-aware compression. Indexed access without full decompression. Retention automation that does not require a human to decide what to delete.</p> Some of those ideas feel worth exploring further.</p> ZagPlot: An Experiment in Learning Zig Through Plotting 2026-02-01T00:00:00+00:00 💡</span> What this is (and is not).</strong> ZagPlot is a private Zig project for learning how plotting libraries work. It is early, messy, and may never be a finished tool. That is fine. The point is what I learn along the way.</p> </div> </div> I wanted to understand what happens between "I have some numbers" and "I have a figure." That can sound simple.. well it apprentely is not.</p> Even a small plot has more parts than you notice until you have to build them yourself: scales, axes, ticks, labels, margins, colors, data parsing, output formats, and sensible defaults. Each one is a design decision you normally inherit from a library. I wanted to see those decisions surface ◆</span> The exercise is similar to writing a parser to understand parsing, or implementing a hash table to understand hash tables. You do not do it because the world needs another hash table. You do it because the implementation teaches you things the interface hides. </span> .</p> I am not building this because the world needs another plotting library. There are good ones already, across multiple languages. Zig alone has several plotting projects, including ones that draw in the terminal, which is genuinely impressive. ZagPlot is not competing with any of them. It would not make sense to try.</p> I am building it because I want to learn a few things that keep coming up in my work.</p> How do you map data into a visual space? How do you design an API that is flexible enough for real use but simple enough that someone else can read it? How do you keep a library path separate from a CLI path? How do allocators flow through a real system instead of a toy example?</p> </blockquote> Those questions are the point. The library is the excuse.</p> I chose SVG for the output format because it keeps the scope contained. SVG is text. You write it, inspect it, diff it, and open it in a browser. You do not need PNG encoders, font renderers, canvas APIs, or GUI frameworks before you can see whether your axis labels line up. The fewer dependencies I pull in, the more I learn about the actual problem I am trying to understand.</p> For now, ZagPlot lives on GitHub as a private repository. I expect it to stay that way for a while. The API will change. Names will get renamed. Parts will get deleted. I want room to make mistakes without pretending the project is ready for anyone else.</p> The rough shape of what I am aiming for</summary> This is not implemented yet. This is the direction I am thinking about.</p> const zag = @import("zagplot"); var plot = zag.Scatter.init(allocator); try plot.load_csv("data.csv"); try plot.render(stdout); </code></pre> Simple, explicit about allocation, works from CLI or as a library. Whether the final API looks anything like this is something I expect to learn by getting it wrong first.</p> </div> </details> If the library becomes genuinely useful for my own work later, I will expand it. More chart types. Better CSV handling. Cleaner defaults. Proper tests. If it does not, teaching me how plotting libraries think is still a good outcome for a project of this size.</p> That is enough. A small thing to learn from, not a thing to ship.</p> 📝</span> The honest version.</strong> I do not know whether ZagPlot becomes something I use, something I throw away, or something I learn from and leave behind. All three are fine outcomes for a project whose main goal is understanding.</p> </div> </div> mzBridge: An Early Attempt to Go from Vendor to Open 2026-01-09T00:00:00+00:00 Mass spectrometry data often starts in the least convenient place possible: inside a vendor format. Before I can think about models, statistics, compression, search engines, or nice downstream tooling, I first need to ask a boring question. Can I read the file without dragging half a runtime, a vendor DLL, or a fragile conversion chain behind me? ◆</span> That question bothers me more than it probably should, but for someone who cares about performance and optimization it makes sense. </span> </p> 📝</span> The mzML reality:</strong> mzML is the format I want to see at the end of the conversion step. It is documented, supported by many tools, and much easier to move between workflows than vendor raw files. The problem is that mzML is often not where the data starts.</p> </div> </div> If the original files are Thermo .raw</code>, Bruker .d</code>, or some other vendor format, then the first step is still conversion. That step can take a long time when cohort size grows. It can also come with annoying practical constraints: operating system assumptions, vendor libraries, large runtimes, and tools that are useful but not as simple as they should be.</p> This is not me saying those tools are bad. ThermoRawFileParser, ProteoWizard, and mzdata-converter exist because people needed a way through the mess. I have used them. They are part of the reason this ecosystem works at all.</p> Reading the file should not feel like the fragile part.</p> </blockquote> For Thermo files, the common path depends on the vendor API or tools built around it. For Bruker .d</code>, things are more open in practice because parts of the format are organized around SQLite ◆</span> Bruker's .d</code> format uses SQLite databases for some metadata structures, which means you can inspect parts of it with standard SQL tools. This is more open than Thermo's binary format, but it is still not the same as having a small reader that treats the data as plain infrastructure. </span> . There is a gap between "we can convert this" and "this is boring enough to build on."</p> I want the boring version.</p> mzBridge is my early attempt to test that idea. The goal is not to build a search engine, a complete replacement for every converter, or a grand universal mass spectrometry platform. The goal is smaller and more annoying: read vendor data directly, turn it into open data, and make the path small enough that it can live inside real workflows.</p> 💡</span> What I envising mzBridge to be.</strong> A small native tool that reads vendor mass spectrometry formats and writes open data. Not a search engine. Not a universal converter. A bridge between the format you have and the format you need.</p> </div> </div> Zig feels interesting for this because the problem is close to the metal. This is binary parsing, file offsets, buffers, compression, validation, and memory layout. It is not the kind of problem where I want a garbage collector making decisions. It is also not the kind of problem where I want to write C and manually hold every sharp edge with my bare hands. I want a small, simple binary that works on any platform.</p> Zig gives me a middle place that I enjoy.</p> Explicit allocation. Small binaries. Easy cross-compilation. Good control over structs and bytes. Enough safety checks that I do not feel like every mistake becomes silent memory corruption. Enough directness that I can still see what the program is doing. That is the appeal.</p> I do not think Zig magically makes this easy. The hard part is not syntax. The hard part is that vendor formats are not designed for independent readers. Some parts can be inferred. Some parts can be validated. Some parts will probably be weird because instrument models, firmware versions, acquisition methods, and software versions all leave their fingerprints in the file.</p> ⚠️</span> The legal constraint.</strong> I cannot reverse engineer this from vendor source code. I do not want to touch anything that makes the legal or ethical situation messy. The only version of this project that makes sense is a clean one: public files, observed behavior, independent parsing, documented assumptions, and validation against outputs produced by accepted tools. I may not make this public for a while. I may not put it on GitHub until I understand the risks better. I do not want a DMCA problem over a tool whose purpose is to make scientific data easier to access.</p> </div> </div> The practical suspicion is simple. A lot of conversion feels slower and heavier than it needs to be because the path is too layered. Vendor API, managed runtime, wrapper, converter, XML writer, then maybe another tool that reads the XML back in. Each layer makes sense historically. Together they make the first step of analysis feel more expensive than it should.</p> I want to know how much of that cost is necessary.</p> Maybe the answer is most of it. Maybe direct parsing runs into too many edge cases. Maybe the format differences across versions make the maintenance burden too high. Maybe the safe public version of this project ends up much smaller than the private experiment. That would still teach me something.</p> Downstream tools inherit the shape of the input step. If the first step is slow, fragile, platform-specific, or legally awkward, then everything after it starts with that friction. Search engines, compression formats, public repositories, automated pipelines, and reproducible analysis all depend on reading the data first. That should be the least dramatic part of the workflow.</p> </blockquote> I do not know yet whether mzBridge becomes a public tool, a private experiment, or just a set of lessons for future projects like mzArc and mzValidate. I know the question is worth testing. Vendor data is where a lot of proteomics begins, and pretending that open science starts only after conversion feels incomplete.</p> So this is the early note to myself. Try the bridge. Keep it clean. Validate everything. Do not overpromise.</p> The Landscape of Proteomics Search Engines, and Does a Zig-Based One Make Sense? 2026-01-02T00:00:00+00:00 📝</span> TLDR:</strong> The proteomics search engine space is crowded and genuinely competitive. DIA-NN, Spectronaut, FragPipe, MaxQuant, and Sage each occupy a real niche. Sage proved a systems language can win on throughput, but features lag behind the decade-old tools. Pipeline projects like quantms and nf-core are already solving the integration problem by wrapping engines in reproducible workflows. A Zig-based search engine is not the right next step. A Zig-based workflow engine built around composable libraries might be.</p> </div> </div> I kept asking myself whether a Zig-based proteomics search engine makes sense. The answer is probably no. The space is crowded, the incumbents are improving fast, and the hard problems are not about search speed. Why that is the case is more interesting than a simple yes or no.</p> The landscape as of early 2026</h2> DDA is not dead, but DIA is where the field is heading. The engines that matter most right now are the ones handling DIA data well.</p> On the open-source side, DIA-NN ◆</span> DIA-NN is built primarily by Vadim Demichev. Supports GPU and CPU, library-based and library-free modes, reads most vendor formats. Free for academic and commercial use. GitHub</a> </span> reset expectations for DIA processing. MaxQuant with MaxDIA handles DIA within the Quant ecosystem. FragPipe ◆</span> FragPipe wraps MSFragger (DDA and DIA search) with DIA-NN (quantification), MSBooster, Percolator, and ProteinProphet in one pipeline. fragpipe.nesvilab.org</a> </span> combines MSFragger and DIA-NN in a single platform. Skyline remains the targeted quantification workhorse.</p> On the commercial side, Spectronaut ◆</span> Biognosys. Polished UI, library-based and directDIA modes, subscription licensing. Pushes users toward the proprietary HTRMS format. biognosys.com</a> </span> has the best UI and turnkey analysis. Proteome Discoverer is Thermo's bundled solution, Windows-bound and slower than alternatives. Mascot still exists but feels like legacy infrastructure.</p> On DDA, MSFragger ◆</span> MSFragger uses fragment ion indexing for fast database search. Open-source, Nesvizhskii lab. Published in Nature Methods</em> (2017). Integrated into FragPipe. </span> changed the speed equation years ago. Sage ◆</span> Sage by Michael Lazear. Rust-based, MIT-licensed. Benchmarks faster than MSFragger on most DDA benchmarks. LFQ and TMT quantification, RT prediction, FDR control. Published in JPR</em> (2023). GitHub</a> </span> is a newer Rust-based engine that benchmarks faster still. One developer, limited development depth. That is not a criticism. It is the reality of a project that started as a personal learning exercise.</p> The space is crowded, and that is a good thing</h2> This is worth saying plainly. The search engine landscape is not broken. It is competitive, fast-moving, and full of genuine innovation. New tools appear regularly. Old tools improve. Benchmarking papers compare them ◆</span> For example, the 2023 Nature Communications</em> benchmarking of DIA-NN, Spectronaut, MaxDIA, and Skyline across Orbitrap and timsTOF data. More recently, comparisons of Spectronaut vs DIA-NN on lung adenocarcinoma biopsies (Yu & Siu, JPR</em> 2026). </span> . The community debates them. This is what a healthy software ecosystem looks like.</p> If you need to analyze proteomics data today, you have good options. Free ones, fast ones, well-documented ones. The problem is not a lack of tools. The problem is stitching them together.</p> DIA-NN and Spectronaut: two models, both working</h2> DIA-NN is remarkable. Built mostly by one person, now one of the most-used DIA tools in the field. Free for academic and commercial use. No license wall between institutions. Fast.</p> Spectronaut has a polished interface, excellent documentation, and dedicated support. It also has a subscription fee, proprietary HTRMS format lock-in, and academic/commercial license tiers. The results are good. The cost, the tiering, and the format lock-in are the parts that frustrate me. If a graduate student learns on Spectronaut and moves to a lab that cannot afford the license, their tool stack breaks.</p> Both models work. DIA-NN proved you do not need a company to build a widely adopted tool. Spectronaut proves commercial polish still commands a market. Neither is going away.</p> Speed is solved. Scale is not</h2> MSFragger solved raw search speed. Sage pushed it further. Search throughput is no longer what keeps people up at night.</p> The new bottleneck is cohort scale. At 20 samples, most tools work. At 500, the ones designed for workstations start creaking. At 1,000, the ones not built for headless environments become painful.</p> Bash</div> ## Run DIA-NN on 500 .raw files diann --dir data/ \ --lib spectral_library.speclib \ --out report.tsv ## Convert FragPipe output for downstream analysis python -c " import pandas as pd combined = [] for f in glob('combined_protein.tsv'): combined.append(pd.read_csv(f, sep='\t')) pd.concat(combined).to_csv('all_results.tsv', sep='\t') "</code></pre> CS1: The gap between tools is where the real work lives.</div> </div> The problems are not CPU cycles. They are memory management across thousands of identifications, file I/O patterns that assume local disk when data lives on network storage, and quantification workflows that break silently when scaled.</p> Spectronaut handles large cohorts if you pay for the server license. DIA-NN handles them if configured correctly. FragPipe works but the Java GUI adds friction on headless servers. MaxQuant will run, but slowly.</p> Feature fragmentation: everyone owns a corner</h2> Nobody does everything well. Each tool has a signature strength.</p> MaxQuant owns label-free quantification. MaxLFQ is the algorithm papers cite without thinking. FragPipe bridges DDA and DIA with MSFragger speed. Spectronaut has the best UI and directDIA mode. DIA-NN has speed, openness, and format support. Skyline owns targeted quantification and method building. Sage has raw throughput and cloud-native design but lacks the quantification depth and PTM analysis of tools that have been evolving for a decade.</p> The niches are deep but narrow. If you need LFQ on a DIA dataset with PTM analysis and batch correction, you are stringing together multiple tools. That works. It is also where the friction lives.</p> The field rewards novelty. A new quantification method gets a paper. A new search algorithm gets a paper. Nobody gets a paper for making tools work together. The incentive structure produces fragmentation.</p> Sage, and what it taught me</h2> I followed Sage from its earliest days. It started as a blog post and a simple repository. Michael Lazear was learning Rust and proteomics at the same time. The project grew from a learning exercise into a JPR</em> publication and a tool people actually use in production ◆</span> Lazear MR. "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale." J. Proteome Res.</em> 22(11):3652-3659, 2023. DOI</a>. MIT-licensed. GitHub</a> </span> .</p> Sage proved two things. First, a modern systems language can enter a mature space and win on raw performance. Second, the gap between a fast search engine and a full-featured platform is enormous. Sage is faster than most tools on clean benchmarks. It also has fewer features, less community testing, and narrower format support than the tools it competes with. ◆</span> That is not a criticism. It is the reality of a project that started as a personal learning exercise. And hasn't had the time to evolve into mature tool. The point is the gap between a fast search engine and a complete analysis platform is measured in years of domain-specific development, not in CPU cycles. </span> </p> A fast search engine is not the same as a complete analysis platform. The distance between them is measured in years of domain-specific development, not in CPU cycles.</p> </blockquote> I watched that trajectory. It made me think about what I could learn by building something similar in Zig. The conclusion I arrived at is not the one I expected.</p> People are already solving the integration problem</h2> The gap between tools is real, but groups are working on it.</p> quantms ◆</span> Dai C, Pfeuffer J, Wang H et al. "quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data." Nature Methods</em> 21:1603-1607, 2024. GitHub</a>. Wraps search engines into reproducible Nextflow pipelines with containers. quantmsdiann wraps DIA-NN for DIA. </span> wraps search engines and quantification tools into reproducible Nextflow pipelines with containerized environments. It supports DDA and DIA workflows and follows nf-core standards.</p> nf-core/proteomicslfq provides LFQ analysis using OpenMS and MSstats. The nf-core community now has dozens of pipelines and thousands of contributors, with a dedicated Mass Spectrometry Proteomics Special Interest Group.</p> These projects show the path forward. The integration problem is being solved by workflow engines wrapping existing tools in reproducible containers. You do not need to rebuild the search engine. You need to make the engines work together at scale.</p> A Zig-based engine? Probably not. Something adjacent</h2> Sage answered the question of whether a new language can produce a competitive search engine. It can. It also answered the question of whether speed alone wins. It does not.</p> The hard problems in search engines are quantification algorithms tuned for a decade, FDR control and protein inference that depend on deep domain knowledge, and PTM localization that is as much biochemistry as computation. Building a new engine means rebuilding all of that. The field does not need another fast searcher. It has several.</p> What I keep coming back to is the workflow engine idea from my earlier thinking ◆</span> See Workflow Engines and the Case for a Zig-Based One</a>. </span> . A Zig-based workflow engine, not a search engine. Something that treats search engines and quantification tools as composable libraries, OpenMS-style ◆</span> OpenMS is a C++ framework for mass spectrometry data analysis with Python bindings (pyOpenMS). Provides modular tools (TOPP) for building custom workflows. openms.de</a>. BSD-licensed. Nearly two decades of continuous development. </span> , but built from the start for the scale and deployment constraints of modern proteomics.</p> The value would not be in beating DIA-NN on identifications. It would be in making the glue between tools more reliable and less fragile.</p> This is a library project, not an engine project. It is also a multi-year effort with uncertain demand. I am not starting it tomorrow. But watching Sage grow from a blog post to a real tool makes the idea feel less abstract.</p> Where I land</h2> The search engine landscape is crowded. It is also healthy. Genuine competition, genuine open-source options, and genuine innovation. The problems are not a lack of good engines. The problems are feature fragmentation, scale challenges, and the integration work of stitching tools together.</p> Pipeline projects like quantms and nf-core are solving integration with Nextflow and containers. That is probably the right approach for production work. A Zig-based workflow engine that treats proteomics tools as composable building blocks might be a better fit for exploration, for learning, and for the kind of custom pipelines where Makefiles still work until they do not.</p> I will keep using DIA-NN for most things. I will keep watching Sage grow. I will keep wishing the tools talked to each other better.</p> And I will probably keep wondering what a modular proteomics toolkit in Zig would look like, even if I never build it.</p> Vendor-Locked MS Files and Open Formats, a Collision 2025-12-28T00:00:00+00:00 📝</span> TLDR:</strong> Instrument vendors use proprietary file formats. Thermo is the most locked down. Bruker ships an SDK. The problem is not that proprietary formats exist. It is that accessing them requires proprietary converters that gate everything downstream. Open formats are necessary, and they can be binary, compressed, and fast. mzML proved the model works. The next step is making open formats the default, not the conversion target.</p> </div> </div> I should be fair to start. Vendors make instruments. Instruments generate data. The data is theirs to format as they see fit. Nobody owes me a CSV.</p> The gap between how instrument data is stored and how it is used has become a bottleneck. Not in theory. In practice. Every proteomics pipeline I work with starts by converting files before any analysis can begin.</p> Bash</div> ## Convert Thermo .raw to mzML before anything else mono ThermoRawFileParser.exe \ -i=PXD000001.raw \ -o=PXD000001.mzML ## Now the real pipeline can start</code></pre> CS1: The first step in every pipeline is a toll booth.</div> </div> That conversion step is the bottleneck I want to talk about.</p> The landscape, not the villain</h2> Vendors sit on a spectrum. It is more useful than singling out one company.</p> Model</th> Example</th> Can you read the data?</th></tr></thead> Proprietary format + proprietary reader only</td> Thermo .raw</td> Only through vendor's DLLs</td></tr> Proprietary format + documented SDK</td> Bruker .d</td> Through SDK, restricted license</td></tr> Proprietary format + public specification</td> Rare in MS</td> Anyone can implement a reader</td></tr> Open format + multiple independent readers</td> mzML, mzMLb</td> Fully open</td></tr> </tbody></table> The gap between the first two rows and the last two is where the ecosystem cost lives. Not in the format itself. In the access layer.</p> Thermo: the door is now cross-platform, but still locked</h2> Thermo Fisher Scientific is the largest manufacturer of mass spectrometers used in proteomics. Their .raw format is a binary blob readable only through proprietary Windows DLLs. For a long time, that meant Linux users needed a Windows VM or Mono to read their own data.</p> That has improved. Thermo now ships RawFileReader ◆</span> Thermo's RawFileReader is a group of .NET assemblies wrapping the ThermoFisher.CommonCore C# libraries. It officially supports Windows, Linux, and macOS through .NET. GitHub</a> </span> , a cross-platform .NET library. ThermoRawFileParser ◆</span> Hulstaert N et al. "ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion." J. Proteome Res.</em> 19(1):537-542, 2020. GitHub</a> </span> builds on top of it and runs on Linux at scale through .NET Core. You no longer need a Windows VM.</p> The framing matters here. Linux access arrived late. The original RawFileReader required Windows, and the community spent years building workarounds before Thermo provided a cross-platform path. It is also still dependent on Thermo's proprietary stack. If Thermo changed their DLL interface tomorrow, every downstream converter would break. That is not a theoretical risk. It is a structural dependency.</p> The community keeps finding creative ways to work around the same door. There is a Rust reader that hosts the .NET runtime in-process to call RawFileReader ◆</span> thermorawfilereader.rs</a> embeds the .NET runtime inside a Rust process. Clever engineering that still depends on Thermo's DLLs. </span> . Another project, ThermoRawRead, provides a GUI and CLI for extracting spectra ◆</span> ThermoRawRead</a> by ctarn is a cross-platform tool built on RawFileReader with a pipeline processing model. </span> . All of them route through Thermo's reader.</p> A vendor engineer would say: "We are not protecting a file format. We are protecting correct interpretation of instrument data." New instruments introduce new detector modes, ion mobility dimensions, and acquisition schemes. If a third-party reader misinterprets the data, users blame the instrument. That concern is legitimate. It also does not require proprietary readers forever. A public specification with conformance tests would serve the same goal.</p> Bruker: easier to use, not truly open</h2> Bruker ships the TDF-SDK ◆</span> Bruker TDF-SDK provides C++ and Python bindings on Windows and Linux for reading .tdf and .tsf files. Bruker TDF-SDK page</a> </span> with documentation, examples, and cross-platform support. Their timsTOF stores data in SQLite and HDF5 containers. That is more accessible than Thermo's binary blob.</p> But accessible is not the same as open.</p> The TDF ecosystem still revolves around proprietary Bruker libraries (timsdata.dll</code> / libtimsdata.so</code>) in many tools. OpenTIMS ◆</span> OpenTIMS</a> parses portions of the .tdf format directly, including the SQLite components. It exists because people wanted access that was less dependent on Bruker's SDK. </span> emerged because the community wanted a path that did not require Bruker's SDK. pyTDFSDK ◆</span> pyTDFSDK</a> provides a Python wrapper around the TDF-SDK DLL. Still depends on the proprietary library. </span> wraps the SDK DLL.</p> Bruker protects intellectual property. TIMS is proprietary ion mobility technology that differentiates their instruments. I am not asking them to give that away. The difference from Thermo is real and worth crediting: Bruker decided that making data accessible to third-party tools is better for customers, and by extension better for business. But the dependency is still on a vendor-controlled SDK, not an open specification. That distinction matters.</p> Not just vendors: the software middleman problem</h2> Vendors are not the only offenders. Biognosys's Spectronaut uses a proprietary format called HTRMS ◆</span> HTRMS is a pre-processed binary format for Spectronaut. Biognosys recommends converting to it for timsTOF data. The converter is free but closed-source. The format specification is not public. </span> . The converter is free but closed-source. The format spec is not public.</p> I should be precise about the harm here. A proprietary internal format that speeds up a specific tool is an engineering choice. The problem is not that HTRMS exists. It is that the format specification is not public, which means the processed data exists in a form only one tool can read. If Spectronaut published the HTRMS layout and documented the encoding, the performance argument would remain and the lock-in would disappear.</p> DIA-NN ◆</span> Demichev V et al. "DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput." Nat. Methods</em> 17:41-44, 2020. GitHub</a> </span> demonstrates that a proprietary intermediate format is not necessary for performance. It processes Thermo .raw, Bruker .d, and Sciex .wiff files ◆</span> DIA-NN supports these formats directly from a user perspective. Some of this support may route through vendor SDKs internally. The point is not that DIA-NN is fully independent of vendor code. It is that the workflow does not require a separate conversion step and a proprietary intermediate layer. </span> without requiring users to manage a separate conversion step. Speed and openness are compatible.</p> The reverse-engineering graveyard</h2> The history of people trying to read Thermo files without Thermo's permission is long and mostly sad. Unfinnigan ◆</span> Unfinnigan</a> was a Google Code project for "painless extraction of mass spectra from Thermo raw files." The name is a jab at the Thermo Finnigan lineage. Archived. </span> was one of the early attempts. It tried to read raw spectra without a proprietary library. It died.</p> OpenChrom reads vendor formats natively through reverse-engineered binary readers. ProteoWizard's msconvert is the workhorse most pipelines depend on. None of these are small projects. All exist because vendors will not publish their formats.</p> The cost is not measured in lines of code. It is measured in abandoned projects, wasted grant cycles, and formats that change without warning.</p> I should be precise about the risk. Thermo has not, to my knowledge, deliberately broken downstream tools by changing their DLL. The risk is subtler. Instrument firmware evolves. New scan modes, new detectors, new ion optics. The format tracks the hardware. When a new instrument ships, the format changes, and every downstream converter chases the update. That is not malice. It is the natural consequence of a closed format that the community cannot maintain independently.</p> The legal uncertainty is itself a cost</h2> This is where the argument lands hardest for me.</p> I have an early-stage idea for reading Thermo .raw files natively. No RawFileReader. No .NET. No Windows. A single static binary that you copy to a cluster node and run.</p> I do not know if it is legal to try.</p> Reverse engineering for interoperability exists in a legal gray area that varies by jurisdiction ◆</span> In the US, reverse engineering for interoperability may be protected as fair use under certain conditions (Sony v. Connectix, Sega v. Accolade). The EU explicitly permits reverse engineering to achieve interoperability under the Software Directive (2009/24/EC). Canadian law is less clear. In all cases, the specifics of how the reverse engineering is done and what license agreements govern the software matter enormously. </span> . Thermo's RawFileReader license ◆</span> RawFileReader ships with a proprietary license document. The license terms around reverse engineering, decompilation, and competitive use are standard for vendor SDKs but deliberately restrictive. The exact boundaries are unclear without legal review, which itself costs money most researchers do not have. </span> restricts what you can do with their reader. Whether analyzing the format independently through clean-room reverse engineering is permitted depends on who you ask and where you are.</p> The uncertainty itself is a burden. A researcher who wants to build a better, faster, more open reader has to either accept legal risk or spend resources on legal review that could go to the actual work. The person most motivated to solve the problem is also the person with the least clarity on whether trying is allowed.</p> That is the structural failure in its purest form.</p> mzML is not the whole answer, and that is fine</h2> mzML is the HUPO-PSI standard ◆</span> The Proteomics Standards Initiative unified mzXML and mzData into mzML in 2008. It has been the community interchange format for nearly two decades. </span> . It is XML-based, verbose, and designed for interoperability over compactness. It solved the problem of having a common format for processed data.</p> I am not arguing against mzML. I am arguing that open formats are bigger than mzML.</p> mzMLb exists ◆</span> Bhamber RS et al. "mzMLb: A Future-Proof Raw Mass Spectrometry Data Format." J. Proteome Res.</em> 20(1):172-183, 2021. DOI</a>. Reference implementation in ProteoWizard. </span> . It compresses spectra into HDF5 datasets while keeping metadata as XML. File sizes comparable to vendor formats. Reference implementation in ProteoWizard. The path forward exists. It needs adoption, not invention.</p> An open format can be binary, compressed, and optimized for random access. What makes it open is not the serialization choice. It is whether you can read it without asking permission. The specification is public. A reference reader exists that does not depend on proprietary libraries. The format does not change in breaking ways without notice.</p> mzML satisfies these. mzMLb improves on the performance dimension. The fight is not about XML versus binary. It is about the conversion layer between the instrument and the analysis.</p> The middleman is the bottleneck</h2> The problem is not that vendors have proprietary formats. Every instrument vendor has one. The problem is that accessing the data requires proprietary libraries controlled entirely by a single company.</p> The dependency chain is real. RawFileReader depends on Thermo's DLLs. ThermoRawFileParser depends on RawFileReader. Every downstream pipeline depends on ThermoRawFileParser. A proprietary format with an open, maintained reader is workable. A proprietary format with no public specification and a proprietary reader gating all access is a structural failure.</p> The stronger argument is not about Thermo specifically. It is that the ecosystem should not depend on any single vendor's reader implementation. That criticism applies equally to Thermo, Bruker, Waters, and Sciex.</p> Where I land</h2> Open formats tend to become dominant when interoperability creates enough economic value. In proteomics, the value is clear: pipelines that cross labs, instruments, and software stacks without a conversion tax. The transition takes time, and the wasted effort accumulates in the meantime.</p> Thermo is the most visible obstacle because they have the largest installed base and the most locked-down access model. They feel like the old guard that has not noticed the world changed. Bruker shows you can sell instruments, protect IP, and make your data more accessible. Neither is fully open, but the gap between them shows the range of possible choices, and Thermo's position is a choice.</p> The software layer matters too. Spectronaut makes a speed argument for HTRMS. DIA-NN shows performance does not require a closed format. If your tool is fast and your format is open, you win on both axes. If your tool is fast and your format is closed, only one of those things ages well.</p> I will keep converting .raw files with ThermoRawFileParser like everyone else. It works. It solved the Linux problem, and I am grateful to the people who built it. But the next generation of proteomics tools should not start with a toll booth.</p> ProteoForge: What Happens When You Stop Averaging Your Peptides 2025-12-20T00:00:00+00:00 I spent the better part of my PhD staring at peptide intensity matrices. Thousands of rows, dozens of columns, lots of missing values. The standard workflow says: roll those peptides up into protein-level quantities, run your differential expression, make a volcano plot, write the paper ◆</span> If you have never seen a peptide intensity matrix, imagine a spreadsheet where most of the cells in columns 5 through 12 are empty, and the ones that are not empty disagree with each other about what the protein is doing. That is bottom-up proteomics. </span> . I did that. Multiple times. It always felt like we were throwing away information.</p> This post is about ProteoForge, the framework that came out of that frustration. It is not a summary of the paper (you can read the preprint</a> for that). It is about the problem as I experienced it, the decisions we made while building the tool, and what surprised us when we finally applied it to real data.</p> The averaging problem</h2> The human genome codes for roughly 20,000 proteins. But the actual number of distinct protein forms in a cell is orders of magnitude larger1</a></sup>. Alternative splicing, post-translational modifications, proteolytic processing: these create proteoforms2</a></sup>, and proteoforms from the same gene can have opposing biological functions.</p> 📝</span> The KRAS example.</strong> Different proteoforms of the KRAS oncogene have opposing effects on MAPK signalling[^3]. KRAS4A and KRAS4B are splice variants with different membrane targeting, different interactomes, and different downstream effects. A protein-level average across both would hide the biology entirely. KRAS is not an edge case. It is the rule at scale.</p> </div> </div> Bottom-up proteomics measures peptides, not intact proteins. The field has lived with this trade-off for years: you get broad coverage but you lose proteoform identity. The standard fix is to group peptides by protein accession, average their intensities, and call it a protein quantity.</p> That averaging step is where the information dies.</p> If three peptides from a protein go up under treatment and two go down, the protein-level average might show no change at all. A flat line on your volcano plot. You move on. You never see that the protein was actually doing something complicated at the proteoform level.</p> What already existed (and where it broke)</h2> Feature</th> PeCorA</th> COPF</th> ProteoForge</th></tr></thead> Discordant peptide detection</td> Yes</td> No</td> Yes</td></tr> Peptide grouping into proteoforms</td> No</td> Yes</td> Yes</td></tr> Missing data tolerance</td> Poor (degrades early)</td> Poor (needs complete data)</td> Good (stable to 60%)</td></tr> Statistical model</td> Per-peptide linear model</td> Peptide correlation matrix</td> RLM with interaction terms</td></tr> Minimum requirement</td> 2+ treatment groups</td> Many replicates, low missingness</td> 4+ peptides per protein</td></tr> </tbody></table> The core issue with both tools, and really with every method we tested: missing data ◆</span> A typical DIA dataset from an Orbitrap will have about 25% missing values at the precursor level. By the time you filter for valid peptides per protein, that number often exceeds 40%. Most statistical methods assume complete data and crash or bias badly when confronted with this. </span> . In a typical DIA proteomics experiment, 20 to 40 percent of peptide measurements are missing. Some are missing at random (instrument did not pick them up). Some are missing because the peptide is genuinely absent in that condition. The distinction matters enormously, and most tools either ignore it or require you to delete incomplete cases.</p> That was the gap.</p> The idea behind ProteoForge</h2> ProteoForge does four things, in order. I will walk through each one, not with equations (the paper has those), but with the reasoning.</p> Figure 1: The four-module ProteoForge pipeline. (A) Framework schematic. (B) Effect of normalization steps on peptide distributions. (C) Three example proteins showing peptide profiles, distance heatmaps, and final dPF assignments.</em></p> Module 1: Normalize relative to a control</h3> Raw peptide intensities are noisy and have different baselines across samples. We log-transform, z-score, and then subtract the mean control intensity for each peptide. After this step, every value represents the deviation of that peptide from its own control baseline. Sounds simple. It is. But it makes the downstream models much cleaner because you are comparing changes, not absolute values.</p> 💡</span> See Figure 1B.</strong> The control-adjusted values collapse the four condition-specific distributions into a shared zero-centered space. This is the input for everything that follows.</p> </div> </div> Module 2: Find discordant peptides</h3> This is the statistical core. For each protein, we fit a linear model that asks: does this peptide behave differently from its siblings across conditions?</p> R</div> Intensity ~ Condition * Peptide</code></pre> CS1: The interaction model used by ProteoForge for each protein.</div> </div> The interaction term (Condition * Peptide</code>) captures the discordance. If a peptide responds to the condition differently than the other peptides from the same protein, the interaction term will be significant.</p> We tested several model types. Ordinary least squares, weighted least squares with custom imputation-aware weights, generalized linear models, and robust linear models (RLM) with Huber M-estimation^{3</a></sup>.</p>} Why we chose RLM over other models</summary> Supplementary Note 1 in the paper contains the full model comparison across OLS, WLS with custom weights, GLM, and RLM. RLM with Huber weights gave the best balance between performance and ease of use. The custom-weighted WLS models can outperform RLM at extreme missingness (above 60%), but they require you to correctly specify the weight matrix, which is error-prone in practice.</p> At moderate missingness (under 50%), all models performed similarly. The differences emerged at the edges: high missingness, many imputed values, unbalanced group sizes. RLM handled all of these without requiring the user to configure anything beyond the default.</p> </div> </details> We settled on RLM as the default because of a property that turned out to be exactly what we needed: it automatically down-weights outliers. Imputed values, especially the down-shifted low values used for condition-complete missingness, look like outliers to the model. RLM treats them accordingly. No manual weight specification needed. The user does not have to think about which values were imputed and which were real.</p> The p-values from the interaction terms go through a two-step Benjamini-Hochberg FDR correction4</a></sup>: first within each protein, then globally ◆</span> The first correction accounts for multiple peptides per protein. The second accounts for multiple proteins genome-wide. A miscleaved peptide or technical artifact would have to survive both rounds, which is why we use this approach despite it being conservative. </span> . Peptides that pass the threshold (we used adjusted p < 0.001) are flagged as significantly discordant.</p> Module 3: Cluster the discordant peptides</h3> Finding discordant peptides is not enough. You need to know which discordant peptides move together, because a proteoform is defined by a group of co-varying peptides, not a single one.</p> We compute the Euclidean distance between median adjusted intensity profiles and cluster with Ward linkage5</a></sup>. The default cut method (hybrid_outlier_cut</code>) determines the number of clusters automatically.</p> This is the part that separates ProteoForge from PeCorA. PeCorA stops at "this peptide is different." ProteoForge goes further: "these three peptides are different in the same way, and they likely belong to the same proteoform."</p> Module 4: Build proteoforms</h3> The naming convention tells the story:</p> Label</th> Meaning</th> Composition</th></tr></thead> dPF_0</code></td> Canonical proteoform</td> All non-discordant peptides. Your "clean" protein signal.</td></tr> dPF_1</code>, dPF_2</code>, ...</td> Differential proteoforms</td> Clusters with 2+ peptides and 1+ significantly discordant.</td></tr> dPF_-1</code></td> Singleton PTM flag</td> Significantly discordant peptides in singleton clusters. Likely single-site modifications.</td></tr> </tbody></table> 💡</span> See Figure 1C.</strong> Three example proteins with their peptide profiles, distance heatmaps, and final dPF assignments. Protein 2 is the interesting one: three peptides cluster into a dPF while a fourth ends up as a singleton PTM.</p> </div> </div> The result is a peptide-to-dPF mapping table. Each protein can have zero, one, or multiple differential proteoforms, plus a canonical form.</p> Benchmarking: what we actually learned</h2> Benchmarking a proteoform discovery tool is tricky because there is no ground truth for proteoforms in most real datasets. We used two approaches: real data with in-silico perturbations (the COPF benchmark approach, using a SWATH-MS interlab HEK293 dataset^{6</a></sup>7</a></sup>), and controlled simulations.</p>} Figure 2: Benchmark results across real and simulated data. (A) SWATH-MS interlab test. (B-E) Simulations 1-4: imputation impact, missingness tolerance, signal strength, multi-condition complexity.</em></p> Four simulation scenarios tested the things that matter in practice:</p> Scenario</th> What was varied</th> ProteoForge</th> PeCorA</th> Takeaway</th></tr></thead> 1. Imputation impact</td> Up to 35% data removed, imputed</td> MCC dropped 0.005</td> MCC dropped 0.158</td> 30x less sensitive to imputation</td></tr> 2. Missingness tolerance</td> 0% to 80% missing values</td> Stable MCC 0.46-0.57 to 60%</td> Degraded sharply from start</td> Hard cliff at 60% for ProteoForge</td></tr> 3. Signal strength</td> Fold change from low to high</td> 0.91 MCC at high FC</td> 0.52 MCC at high FC</td> Nearly 2x better at strong signal</td></tr> 4. Multi-condition</td> 2 to 6 conditions</td> Flat performance</td> Variable</td> Scales to complex designs</td></tr> </tbody></table> 📝</span> The key visual is Figure 2C.</strong> ProteoForge's MCC line stays flat across increasing missingness while PeCorA's drops linearly. Panels A through E map to the SWATH-MS interlab test and simulations 1 through 4 respectively.</p> </div> </div> For peptide grouping (which PeCorA cannot do), ProteoForge consistently hit MCC around 0.45 to 0.61, while COPF hovered around 0.27. At a p-value cutoff of 0.001, COPF's grouping MCC was near zero in most simulated conditions.</p> One result surprised us. In the imputation simulation, ProteoForge's grouping</em> performance actually improved slightly after imputation (MCC from 0.60 to 0.61). Why? When a peptide is completely missing in one condition and gets imputed with a down-shifted low value, it creates a strong, unambiguous signal. The clustering picks it up correctly.</p> What about AlphaQuant and MSstatsWeightedSummary?</summary> Two other tools address related problems in this space. AlphaQuant uses tree-based quantification to organize peptide data hierarchically and can infer proteoform groups from bottom-up data. MSstatsWeightedSummary handles the shared peptide problem through weighted summarization with Huber loss. Both are complementary to ProteoForge rather than direct competitors. ProteoForge focuses specifically on the interaction between missing data, discordant peptide detection, and proteoform grouping. The three tools answer different questions and could be used in combination on the same dataset.</p> </div> </details> The hypoxia application</h2> Benchmarks are one thing. Real biology is another.</p> We applied ProteoForge to a published dataset from Tomin et al. (2025)^{8</a></sup>, who measured the proteome of H358 lung cancer cells under normoxia (21% O2) and hypoxia (1% O2) at 48 and 72 hours. The data came from DIA-NN9</a></sup> as precursor-level quantities. After filtering and imputation, we had 95,369 peptides across 7,161 proteins.</p>} The numbers alone tell a story. At 48 hours, about 15% of proteins had multiple discordant peptides. At 72 hours, that jumped to 36%. Hypoxia is not a subtle perturbation, and the proteoform-level response grows over time in a way that protein-level analysis completely misses. Figure 3A</strong> shows this distribution shift from 48 to 72 hours. The fraction of proteins with multiple significantly discordant peptides more than doubles.</p> The original study by Tomin et al.8</a></sup> reported increased levels of S-lactoylglutathione (a GLO1 product) under hypoxia, but no change in GLO1 protein abundance. That is a contradiction. If the product goes up, something about the enzyme should be different.</p> ProteoForge resolved it.</p> At 48 hours, one peptide in GLO1 (harboring a known phosphorylation site at T35 and a mutagenesis site at Q34) showed elevated intensity under hypoxia while the other peptides stayed flat. By 72 hours, this expanded into a multi-peptide dPF (three peptides, including one with ubiquitination/acetylation at K148). The protein average saw nothing because the canonical peptides diluted the signal.</p> 💡</span> Figures 3D and 3E</strong> show the GLO1 story in full. 3D has the peptide intensity profiles across conditions. 3E maps each peptide against known PTMs and protein features. The phosphorylation at T35 is particularly interesting: it sits near GLO1's active site, and the discordance emerges before the multi-peptide dPF forms at 72 hours.</p> </div> </div> This is what I mean by "the averaging problem." GLO1 was regulated. The original analysis just could not see it.</p> FPGT: catching isoform-driven variance</h3> Fucose-1-phosphate guanylyltransferase (FPGT) provided a different kind of example. At 48 hours, all peptides behaved the same. At 72 hours, two peptides spanning amino acids 60 to 95 dropped under hypoxia while the rest stayed stable.</p> Figures 3F and 3G</strong> show the FPGT peptide profiles and protein schematic, with the two discordant peptides highlighted in the N-terminal region and isoform boundaries marked.</p> What happens to your pathway analysis</h3> This was, for me, the most striking result.</p> We ran four different protein-level quantification strategies through QuEStVar10</a></sup> (our equivalence testing framework from the earlier paper) and then into pathway enrichment with g:Profiler11</a></sup>:</p> Quantification strategy</th> % Proteins regulated at 72h</th> GO:BP terms found</th></tr></thead> DIANN Original</td> 2%</td> Near zero</td></tr> Mean (top 3 peptides)</td> ~3%</td> Minimal</td></tr> Mean (all peptides)</td> ~4%</td> Minimal</td></tr> ProteoForge dPFs</td> 23%</td> 87 elevated + 181 reduced</td></tr> </tbody></table> That is not a marginal improvement. That is the difference between "hypoxia did almost nothing to this proteome" and "hypoxia restructured a quarter of the proteome at the proteoform level."</p> 📝</span> Figures 4C and 5 tell this story visually.</strong> Figure 4C shows the stacked bar charts of regulatory classifications. The contrast between "Original" and "dPFs" bars is startling. Figure 5 shows the downstream pathway enrichment: "Original" finds almost no elevated or reduced GO terms at 72 hours while "dPFs" finds hundreds, including canonical hypoxia response pathways.</p> </div> </div> The pathway results were telling. The original analysis found broad equivalence across GO categories at 72 hours. The dPF analysis found canonical hypoxia response pathways correctly classified as elevated. Cell cycle terms were reduced. Metabolic terms showed complex, time-dependent shifts. Signaling terms appeared elevated.</p> The "Original" analysis missed all of this. Not because the data was bad. Because the quantification averaged it away.</p> Things I wish I had known earlier</h2> A few practical notes for anyone thinking about using ProteoForge.</p> ⚠️</span> The 60% missingness cliff.</strong> ProteoForge handles missing data better than anything else we tested, but there is a hard limit around 60%. Past that, RLM starts treating imputed values as the majority signal rather than outliers, and performance drops sharply. If your dataset has more than 60% missing values, consider using the WLS model with custom imputation weights instead of RLM (see Supplementary Note 1). Or consider whether the dataset is salvageable at all.</p> </div> </div> ⚠️</span> Minimum four peptides per protein.</strong> ProteoForge requires at least four peptides per protein. This is not arbitrary. The interaction model needs enough peptides to estimate the protein-level consensus and detect deviations from it. Proteins with fewer peptides are excluded. This means ProteoForge is best suited for DIA or deep DDA datasets. Shallow experiments with 1 to 2 peptides per protein will not benefit.</p> </div> </div> It is not a package yet.</strong> The analysis repository on GitHub</a> contains the scripts and notebooks used for the manuscript. A standalone Python package is under development but not released. If you use the current code, expect some rough edges.</p> Why this matters (to me)</h2> I started this work because of pediatric cancer. My PhD focused on computational proteomics for childhood leukemia, and the proteoform question kept coming up. In our leukemia data, we would see proteins where the overall level looked stable between diagnosis and relapse, but specific peptides shifted dramatically. We could not make sense of it with protein-level tools.</p> ProteoForge came from that frustration. The hypoxia application in the paper is a clean, well-controlled dataset that made for a good proof of concept. But the motivation was always translational: can we find proteoform-level changes in disease that protein-level analysis misses?</p> The GLO1 result gives me confidence that the answer is yes.</p> What comes next</h2> The preprint is currently under review at the Journal of Proteome Research</em>. The standalone Python package is in development.</p> If you want to try it on your data, the analysis repository</a> has everything. The Zenodo snapshot (doi:10.5281/zenodo.17795845</a>) freezes the exact version used in the manuscript.</p> Paper:</strong> Ergin, E. K., Conrrero, A., Ferguson, K. M., Lange, P. F. (2025). ProteoForge: An Imputation-Aware Framework for Differential Proteoform Discovery in Bottom-Up Proteomics. bioRxiv</em>. doi:10.64898/2025.12.12.694008</a></p> Code:</strong> github.com/LangeLab/ProteoForge_Analysis</a></p> Data:</strong> PRIDE repository, PXD062503</a> (Tomin et al. hypoxia dataset)</p> Smith, L.M. et al. (2021). The Human Proteoform Project: Defining the human proteome. Science Advances</em>, 7(46). doi:10.1126/sciadv.abk0734</a> ↩</a></p> </li> Smith, L.M. & Kelleher, N.L.; Consortium for Top Down Proteomics (2013). Proteoform: a single term describing protein complexity. Nature Methods</em>, 10(3), 186-187. doi:10.1038/nmeth.2369</a> ↩</a></p> </li> Huber, P.J. (1964). Robust Estimation of a Location Parameter. Annals of Mathematical Statistics</em>, 35(1), 73-101. doi:10.1214/aoms/1177703732</a> ↩</a></p> </li> Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B</em>, 57(1), 289-300. doi:10.1111/j.2517-6161.1995.tb02031.x</a> ↩</a></p> </li> Ward, J.H. Jr. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association</em>, 58(301), 236-244. doi:10.1080/01621459.1963.10500845</a> ↩</a></p> </li> Peptide Correlation Analysis (PeCorA) Reveals Differential Proteoform Regulation (2021). Journal of Proteome Research</em>, 20(4). doi:10.1021/acs.jproteome.0c00602</a> ↩</a></p> </li> Bludau, I. et al. (2021). Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nature Communications</em>, 12, 3810. doi:10.1038/s41467-021-24030-x</a> ↩</a></p> </li> Tomin, T. et al. (2025). Increased antioxidative defense and reduced advanced glycation end-product formation by metabolic adaptation in non-small-cell-lung-cancer patients. Nature Communications</em>. doi:10.1038/s41467-025-60326-y</a> ↩</a> ↩2</a></p> </li> Demichev, V. et al. (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods</em>, 17, 41-44. doi:10.1038/s41592-019-0638-x</a> ↩</a></p> </li> Ergin, E.K., Myung, J.J.K., Lange, P.F. (2024). Statistical Testing for Protein Equivalence Identifies Core Functional Modules Conserved across 360 Cancer Cell Lines. Journal of Proteome Research</em>, 23(6), 2169-2185. doi:10.1021/acs.jproteome.4c00131</a> ↩</a></p> </li> Raudvere, U. et al. (2019). g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research</em>, 47(W1), W191-W198. doi:10.1093/nar/gkz369</a> ↩</a></p> </li> </ol> </section> Workflow Engines and the Case for a Zig-Based One 2025-10-29T00:00:00+00:00 📝</span> TLDR:</strong> I have used Snakemake, Nextflow, and GNU Make to run real proteomics pipelines. All three solve the big problem. Reproducible execution graphs. And all three make the small things harder than they should be. Container overhead, JVM startup, and general-purpose design add friction that a domain-specific engine could remove. This is not a plan. It is a direction I am curious about.</p> </div> </div> I keep coming back to an idea I have not built.</p> A workflow engine for bioinformatics. No runtime. No JVM. No Conda environments to maintain. A single static binary that knows what mzML and FASTA are, because those are the files we actually work with. Copy it to a cluster node and run it.</p> I know how that sounds. The space already has Nextflow, Snakemake, CWL, WDL ◆</span> And Cromwell, Toil, Luigi, Prefect, Airflow, Parsl. The design space is wider than the bioinformatics corner. I name the ones I have used. </span> . They all have real users and real papers. Adding one more sounds like hubris.</p> The idea will not leave me alone. I want to explain why, and where it is probably wrong.</p> The tools I actually use</h2> My first workflow engine was a Makefile. It resolved dependencies, skipped completed steps, and parallelized across cores. For a single-machine analysis with a few dozen samples, it was enough.</p> Make</div> results/%.mzML: raw/%.raw wine msconvert $< --mzML -o $@ results/%.pep.xml: results/%.mzML tide-search $< $@ results/%.tsv: results/%.pep.xml percolator $< $@</code></pre> CS1: What most of my workflows actually look like.</div> </div> Then the data grew. I moved to Snakemake because it understood glob patterns and could distribute to a cluster. I moved to Nextflow because the lab standardized on it.</p> Both are good. Both frustrate me.</p> Snakemake: great rules, heavy environments</h2> Snakemake is the most natural fit for a Python-native bioinformatician. Rules extend Python syntax. The DAG is built before execution so you see what will run before it starts. I liked using it.</p> The pain is the environment layer. Snakemake's conda:</code> directive pins environments per rule. For a pipeline with twenty steps, that is twenty environments to maintain. When one Conda solve hangs or a container pulls the wrong tag, you spend an afternoon fixing infrastructure. The Python runtime is not the problem. What sits on top of it is.</p> I also wonder if per-rule environments are solving the wrong problem. My pipelines depend on maybe four tools plus Python. Twenty Conda environments means I have managed the same numpy install twenty times.</p> Nextflow: channels are good, startup is not</h2> Nextflow's channel model is genuinely good ◆</span> DSL2 channels handle per-sample branching in a way that Snakemake's rule-level scoping does not. When a pipeline step decides what to run next based on the previous output, that matters. </span> . Data flows asynchronously. You branch, merge, and filter without writing explicit orchestration code. For hundreds of samples with per-sample variability, this model makes hard things manageable.</p> The startup cost is what I notice most. Even with NXF_JVM_ARGS</code> tuned, there is a visible delay before the pipeline begins doing useful work. The JVM heap overhead ◆</span> The head process uses hundreds of megabytes. For pipelines running massive tools like MSFragger or DIA-NN, that overhead is negligible. The stronger problem is startup latency and the extra layer of abstraction when debugging. </span> is rarely the dominant resource cost in pipelines where the tools themselves use tens of gigabytes.</p> I have debugged a Nextflow pipeline at 2 AM. I do not recommend it.</p> </blockquote> Here is where I go back and forth. The channel model solves a real problem. The startup latency and debugging friction are taxes. Are they coupled? Nextflow does not need the JVM for the channel model. It needs the JVM because it is written in Groovy. That is a historical choice, not a technical necessity.</p> The container tax</h2> Both engines lean on containers for reproducibility. The idea is sound. Modern HPC clusters cache container layers effectively ◆</span> Apptainer and Singularity make this viable: pull once, run hundreds of jobs from the cached copy. Container startup drops dramatically after the first pull. </span> .</p> The friction I feel is less about startup and more about maintenance. Image registries go down. Dependency rebuilds break cached layers. Provenance across container versions requires discipline that a shared cluster rarely enforces. The problem is not "containers are slow." It is "containers add a maintenance surface between me and the science."</p> Containers exist because shared environments drift. That is a real problem and I do not have a better answer. I want reproducibility without managing a container registry on every invocation. I do not know if that tension resolves.</p> What bioinformatics workflows actually need</h2> Here is the honest truth about where the time goes. In most proteomics pipelines, roughly 95% of wall-clock time is in search and quantification tools, 4% is data movement, and 1% is orchestration ◆</span> Even eliminating all JVM and container overhead cuts the 1%, not the 95%. Speed is not the argument for a different engine. The argument is about deployment, debugging, and cognitive load. </span> . Eliminating JVM and container overhead changes the total runtime by barely a rounding error.</p> Speed is not the argument. The argument is that the orchestration layer should not add cognitive overhead that exceeds its runtime contribution. A fast, simple engine that I can understand and debug without a specialized DSL is worth having even if the pipeline runs the same wall-clock time.</p> What would a format-aware engine actually do?</h2> This is the question the idea keeps dodging. What does it mean for an engine to know mzML and FASTA?</p> It means the engine can validate inputs before the pipeline starts. A malformed mzML file is caught before it reaches the search tool, not during the fifth hour of a search run. It means the engine can extract metadata (instrument type, precursor tolerance, sample annotations) and propagate it through the DAG without manual wiring. It means the engine can construct parts of the pipeline graph automatically: if the input is mzML and the tool expects FASTA, the engine knows to insert a conversion step.</p> Current engines treat file formats as opaque strings. That is correct for general-purpose tools. For a domain-specific engine, format awareness means the engine understands the data, not just the file paths.</p> I do not know how far this stretches. It might be a small advantage that does not justify a new engine. It might open pipeline patterns that are awkward in current systems. This is where the novelty lives, not in the language choice.</p> Why Zig keeps coming up</h2> I should say this plainly: the language is not the hard part of a workflow engine. The hard parts are error recovery, checkpointing, cloud integration, and provenance tracking. C would work. Rust would work. Go works well and already has excellent concurrency and networking for orchestration workloads ◆</span> Go is understated in most Zig comparisons. It has mature networking, stable concurrency, large ecosystem, and cross-compiles to static binaries. Many workflow tools are written in Go for good reason. Zig's advantage over Go for this problem is C interop, and even that matters less if the engine mostly orchestrates external processes. </span> .</p> What keeps me coming back to Zig is the combination of properties.</p> Zig produces a single static binary with no runtime. Deploy the engine with a file copy. No Python environment. No JVM. No Conda.</p> Zig calls C directly ◆</span> @cImport</code> on a C header and the functions are available. No binding layer, no FFI ceremony. ProteoWizard is C++ (needs a C ABI wrapper), but zlib, netCDF, and most compression libraries are straight C. </span> . For reading mzML through existing C libraries and compressing output, that integration path is short.</p> Zig has explicit memory. No garbage collector. For a long-running pipeline on a shared cluster node, predictable memory behavior matters more than peak throughput.</p> These are not unique to Zig. The difference is C interop. Rust needs bindgen</code> and FFI safety wrappers. Go needs cgo. Both work, both add ceremony. For a tool that wraps other tools, ceremony compounds.</p> What I have not figured out</h2> I have not built this. The idea is still a question mark.</p> Zig is pre-1.0 ◆</span> This is the biggest practical concern. A workflow engine is infrastructure software. Infrastructure values stability, ecosystem, and backwards compatibility more than language elegance. Starting an engine on a pre-1.0 language means accepting that every release may break the build. </span> . It will break my code between releases. The ecosystem is tiny compared to anything in the workflow space. If the engine needs a feature that Nextflow already has, the gap is not weeks of work. It is the accumulated years of edge cases that existing engines have already absorbed.</p> I also have not settled the most basic question: should this be a new engine, or a library, or a set of Zig tools that slot into an existing engine? A library that generates Makefiles solves deployment without solving the DAG problem. A Zig plugin for Nextflow gives format awareness without the JVM, if that is even possible.</p> Where I land</h2> Here is the honest version. A production workflow engine to replace Nextflow is probably a bad idea. The existing tools have absorbed too many edge cases. Even if I got the architecture right, the gap in real-world testing would take years to close.</p> But building a small one, not to ship but to understand? That is worth doing. The value is not the engine. It is the understanding of what makes workflow orchestration hard, what tradeoffs are real, and where current tools make choices that a domain-specific approach could avoid.</p> I will keep using Nextflow and Snakemake for production work. They work. They have communities. They ship reproducible pipelines that produce real results.</p> And I will probably build something small in this direction. Not to replace Nextflow. To understand the problem space well enough to know whether the idea has anything in it.</p> Why I'm Learning Zig Instead of Doubling Down on Rust 2025-05-25T00:00:00+00:00 📝</span> TLDR:</strong> Zig excels at the boring foundation layer. Small parsers, validators, indexers. Tools with no runtime, explicit memory, direct C interop, and a single static binary. Not replacing Rust for serious projects. Just occupying a different niche.</p> </div> </div> How I got here</h2> I defended my PhD in April 2025. I took a month and a half off, after it I will start as a staff bioinformatician in the same lab where I did my PhD. Same place, new role, to continue excited projects that we still haven’t finalized.</p> I care a lot about performance. I am the kind of person who profiles code that already runs fine, just to see if it can run faster. In academia the most important part is if the logic is sound and output is valid, the speed, memory, overall resource usage comes later. Unless you are not working with extremely large data or in a pipeline where every resource usage counts, optimizing things is secondary. Well, I spent my time with making my tools optimized anyway. Likely gotten my PhD 1-2 years longer than it should be. Slow code bothers me more than it probably should.</p> I already knew Rust. I had used it a bit, mostly small binders to speed up slow Python loops. It worked well and I respected it, I’ve definetly seen how it can actually modern toolkit for performant and safe language.</p> During my time off apart from gaming and enjoying my time off and out of the high that I finished my PhD, I’ve learned about Zig’s existence. It did peak my interest, it was relatively less known with a very particular philosophy that resonated with me.</p> What caught me</h2> I am not going to list version numbers or changelogs. The Zig website covers all of that. What I liked was how the language works day to day. Also will mention Rust a lot since there are obvious similarities and there are big differences to mention…</p> Zig is explicit. There is no hidden control flow, and nothing allocates memory unless I ask it to. When I read a function, I can see exactly what it does. For someone who spends a lot of time tracking down where an allocation came from, that matters to me. Also it is not as heavy or strict as Rust that I know. Which hits a well balanced with providing memory safety, and lightweightnes.</p> It also works with C directly. Zig treats C as a first-class citizen, which is the phrasing the Zig people like to use, and it fits. As far as motto or selling phrases it is appropriate. A lot of bioinformatics runs on C libraries, and Zig can call them without a separate binding layer. ◆</span> You can @cImport a C header and the functions just work. No FFI ceremony, no bindgen step, no build system wrestling. </span> . At the end you get one small static binary that you can copy to a cluster and run. That is a nice thing to have.</p> Rust is winning. So what?</h2> I want to be fair here, because I am not trying to argue against Rust. It is brilliant. It is providing support for Python like I don't think we've seen before. ◆</span> Polars is a great example: DataFrame operations at Rust speed with a clean Python API. </span> </p> Rust is winning in bioinformatics. It is everywhere, and it is being pushed hard for good reasons. The I/O libraries are mature, the format parsers are solid, and even in proteomics there are strong Rust tools in real use. If you are starting a serious project today and ask me what to use, I would tell you Rust.</p> It is also the better career choice, and I know that. Rust is the thing on the resume and the keyword in the job post. Picking it is the sensible decision.</p> But where is the fun in that.</p> The boring tools niche</h2> I should be clear about one thing: I have not built anything real in Zig yet. I am putting my thoughts at a very early on stage, at a stage where I am learning a language that I decided to invest in so some bias could be expected.</p> Everthing I talk about Zig is based on what I’ve seen so far and a personal “Hunch”. But my hunch is specific though. The work I keep coming back to is the boring foundation. Parsers, indexers, validators. Small tools that hold a pipeline together and cause real problems when they break. They tend to be small, they stick around for years, and they have to be fast and predictable on whatever machine the data is on.</p> That is what Zig looks good for. Small, no runtime, explicit about memory, and easy to use with existing C code. For that kind of tool, it feels like a good fit rather than a compromise.</p> And it is fun to write. It makes me want to build things on my own time again, which I had not felt much during the PhD.</p> The honest caveats</h2> I guess I have to honestly mention caveats, much like most Zig early adapters have to disclose. It goes without saying a language that is pre 1.0 will break your code between releases. That is the plan, not an accident, so if you build on it now you should expect to do migrations.</p> The ecosystem is small compared to Rust. Fewer libraries, fewer examples, and fewer people who have already solved your problem. Sometimes you have to work it out yourself. There is also no bioinformatics presence, which I might work on closing in the future…???</p> So I am not telling anyone to skip Rust and bet their career on Zig. Learn Rust and use Rust. It is the smart choice. I just want to spend some of my own time on Zig because I think I will enjoy it, and I think it might have a good adoption in the future.</p> Where I land</h2> I think Zig is a good fit for the foundational tools in bioinformatics, the small fast pieces everything else sits on. I cannot prove that yet, because I have not built it yet.</p> For now, I will use Rust when the work calls for it, and Zig when I get to choose.</p> proDA 2020-05-05T00:00:00+00:00 Note:</strong> This reading note was originally written in 2020 when I first encountered the preprint. I have expanded and updated it in June 2026 with proper citation details (the paper was published in Journal of Proteome Research</em> after the initial bioRxiv version) and verified that the proDA package remains actively maintained on Bioconductor.</p> </blockquote> 📝</span> Paper:</strong> Probabilistic dropout analysis for identifying differentially abundant proteins in label-free mass spectrometry</a> Authors:</strong> Constantin Ahlmann-Eltze, Simon Anders Published:</strong> bioRxiv (2019), later published in Journal of Proteome Research</em> (2020) ◆</span> I originally read the bioRxiv preprint, but if you are citing this in a paper, use the JPR version. — EKE, June 2026 </span></p> Tool:</strong> proDA R package</a> | GitHub</a> ◆</span> As of June 2026, the GitHub repo shows recent commits and the Bioconductor package is in active release (version 1.26.0). This is not abandonware. The method has held up and the implementation is maintained. — EKE, June 2026 </span></p> Citation:</strong> Ahlmann-Eltze, C. & Anders, S. proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Mass Spectrometry. J. Proteome Res.</em> 19</strong>, 1761–1774 (2020).</p> </div> </div> Why this one stuck</h2> proDA does offer an option to do testing with missing values. Missing values in label-free data are not always random. You miss the low-intensity stuff systematically. Drop in a median or some small constant and you've quietly invented data points. Your matrix looks complete. Your downstream code runs without complaints. Your p-values are confident. They're also biased.</p> proDA refuses the shortcut. Instead of filling holes, it fits a sigmoidal dropout curve per sample. If a value is missing, the model says "this protein was probably below detection, here's the probability distribution for where it might have been." Then it carries that uncertainty through to the differential abundance test. The error bars grow. The significant hits shrink. The ones that survive are real.</p> proDA's probabilistic dropout model showing how missing values are modeled as censored observations below detection threshold</div> Figure</div> </div> </div> The dropout model fits a curve describing missingness probability as a function of intensity. Proteins with all missing values (blue) have wide uncertainty. Proteins with some observations (orange, green) combine observed data with modeled dropout probability. From Ahlmann-Eltze & Anders, 2019.</p> </figcaption> </div> </figure> What you actually get</h2> The practical difference is honest uncertainty. If you have a protein with 5 missing values and 1 weak observation, traditional imputation fills those 5 holes with guesses and pretends you measured 6 data points. proDA says "you measured one marginal signal, the rest fell below detection, here's how uncertain your mean estimate actually is."</p> Your significantly differentially abundant list gets shorter. That's the point. The proteins you lose were riding on invented data. The ones you keep passed a test that acknowledges how little you actually saw.</p> When it matters</h2> This matters most when you're running differential abundance tests and your FDR depends on accurate variance estimates. If you're just making a heatmap for a figure or doing quick exploratory clustering, imputation is fine. The problem is when you impute for visualization and then forget you did it before running statistics.</p> 📝</span> Final thought:</strong> proDA is a reminder that the data we don't see can be just as important as the data we do. Since 2020, there have been many methods have been developed around this idea, I myself even have been using limma with weighted testing and robust linear models to go around imputing missing values. Imputation can give us a false sense of confidence. Modeling missingness explicitly keeps us honest about what our data can actually tell us. — EKE, June 2026</p> </div> </div> Prosit 2019-07-15T00:00:00+00:00 Note:</strong> I first wrote this note in the summer of 2019, right after the Nature Methods paper dropped. When I rebuild my webside and migrate the posts I had an oppurtunity to provide edits in June 2026 with proper citation details and a check on where the models live now, since the original code repository was retired in favor of newer tooling.</p> </blockquote> 📝</span> Paper:</strong> Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning</a> Authors:</strong> Siegfried Gessulat, Tobias Schmidt, Daniel P. Zolg, and colleagues (Mathias Wilhelm and Bernhard Kuster labs) Published:</strong> Nature Methods</em> (2019), published online 27 May 2019 ◆</span> The Nature Methods version is the one to cite. The preprint history is messy but the final paper is solid. — EKE, June 2026 </span></p> Tool:</strong> Prosit on ProteomicsDB</a> | Koina model server</a> ◆</span> The original kusterlab/prosit GitHub repository was archived in August 2023. The models moved to Koina, the training code moved to dlomix, and the rescoring/library generation moved to Oktoberfest. The method did not die, it grew up and got integrated into real workflows. — EKE, June 2026 </span></p> Citation:</strong> Gessulat, S. et al.</em> Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods</em> 16</strong>, 509–518 (2019).</p> </div> </div> Why this one stuck</h2> I had filed spectral prediction under "neat demo" and left it there. Prosit moved it off that shelf. It predicts fragment intensities and indexed retention time from peptide sequence alone, and the predictions are good enough to feed a rescorer or build a DIA library without measuring a single real spectrum.</p> The shift was not in the math. Predicting spectra from sequence is a straightforward supervised learning problem if you have enough training data. The shift was in the accuracy. The predictions crossed the threshold where people started using them in production pipelines instead of publishing them and moving on.</p> Mirror plot comparing a Prosit-predicted MS2 spectrum against an observed spectrum, showing close agreement of fragment intensities</div> Figure</div> </div> </div> Predicted fragment intensities sit close enough to observed spectra to be used as features for rescoring and as the basis for spectral libraries. From Gessulat et al., 2019.</p> </figcaption> </div> </figure> What you actually get</h2> The practical impact shows up in two places. First, rescoring gets better. When you give a rescorer predicted spectra as features, it can actually distinguish good matches from noise instead of gambling on search engine scores alone. Second, you can generate spectral libraries for organisms or proteases nobody has measured. Just predict from sequence.</p> The interesting part is the architecture. It is not complicated. A bidirectional LSTM with attention, nothing exotic. The win came from scale. Enough training data, enough compute, and suddenly predictions that were too noisy to trust become informative.</p> When it matters</h2> This matters when you need spectral libraries for something outside the standard human/mouse/yeast space, or when rescoring database searches. It matters less if your peptides sit far from the training set. Early models struggle with unusual modifications or nontryptic cleavage. The newer models on Koina handle more, but the boundaries are still real.</p> The other place it matters is as proof of concept. Prosit showed that deep learning could do more in proteomics than classify spectra. It could generate them. That opened doors. Since 2019, whole workflows have been built around predicted spectra. The method did not stay in papers. It moved into tools people actually run.</p> 📝</span> Final thought:</strong> I never used Prosit directly in my own work. What stuck with me was watching spectral prediction go from "interesting idea" to "thing people depend on." The math is not mysterious. The training data and engineering are what made it work. It is a good example of how deep learning earns its place in proteomics: not by being clever, but by being useful enough that the field builds around it. — EKE, June 2026</p> </div> </div> Fuzzy Clustering 2018-05-01T00:00:00+00:00 Note:</strong> This post was originally published on my old blog on 2018-05-01 and has been transferred here. I have rewritten parts of the original article for clarity and style while keeping the main story and facts intact. Where my current self disagreed with or wanted to expand on the original, I added margin notes signed — Eke, May 2026</em>.</p> </blockquote> I first ran into fuzzy clustering during a machine learning course in my undergrad. The idea that a single data point could belong to multiple clusters at once felt wrong ◆</span> Eight years later, this still makes me smile. The friction I felt then is exactly what makes fuzzy clustering interesting. If the answer were obvious, you would not need an algorithm. What I did not appreciate at the time was how naturally this maps onto Bayesian thinking: the membership coefficients are essentially posterior probabilities over cluster assignments. — EKE, May 2026 </span> . Either something is in a group or it is not, right?</p> Turns out, the world is rarely that clean.</p> Think about a customer who buys both hiking gear and cooking equipment. Do you put them in the "outdoors" segment or the "food" segment? A hard clustering algorithm forces you to pick one. Fuzzy clustering says: they are 60% outdoors, 40% food. That is more useful for most real problems ◆</span> Customer segmentation was the go-to example in every 2018 ML tutorial, and I was no exception. The example is fine but it hides a subtle point: the membership coefficients are only as good as the feature space they live in. If your features do not separate the underlying behaviors, the coefficients will look uniform and tell you nothing. I have seen plenty of projects where fuzzy clustering gave near-uniform 1/C membership across all points, and the team concluded the algorithm did not work. The algorithm was fine. The features were the problem. — EKE, May 2026 </span> .</p> This post walks through what fuzzy clustering is, how the Fuzzy C-Means algorithm works under the hood, and how to implement it in R and Python.</p> Clustering in two flavours</h2> Cluster analysis is the task of grouping objects so that objects in the same group are more similar to each other than to objects in other groups1</a></sup>. There are dozens of algorithms, but they fall into two broad categories:</p> Hard clustering:</strong> every point belongs to exactly one cluster. K-means, hierarchical clustering, DBSCAN all work this way.</li> Soft (fuzzy) clustering:</strong> every point has a membership coefficient for each cluster, ranging from 0 to 1. The coefficients sum to 1 across clusters.</li> </ul> Hard clustering is simpler and faster. Fuzzy clustering is more expressive ◆</span> This hard/soft binary is itself a simplification. DBSCAN has noise points that belong to no cluster, hierarchical clustering has merges that are fuzzy until you cut the dendrogram, and spectral clustering operates in an embedding where distances themselves are soft. The lines are blurrier than I made them sound here. If I were writing this today I would frame it as a spectrum of assignment granularity rather than a hard dichotomy. — EKE, May 2026 </span> . Which one you use depends on whether your data has clear boundaries or graded transitions.</p> Fuzzy C-Means</h2> The Fuzzy C-Means (FCM) algorithm, developed by Dunn in 1973 and improved by Bezdek in 19812</a></sup>, is the fuzzy counterpart to K-means. The core idea is the same: find cluster centers and assign points to them. But instead of a hard assignment, each point gets a membership value for every cluster.</p> The objective function</h3> FCM minimizes the following:</p> $$ J_m = \sum_{i=1}^{N} \sum_{j=1}^{C} u_{ij}^m |x_i - c_j|^2 $$</p> where:</p> $N$ is the number of data points</li> $C$ is the number of clusters</li> $u_{ij}$ is the membership of point $i$ in cluster $j$</li> $m$ is the fuzziness parameter ($m > 1$), controlling how soft the boundaries are</li> $x_i$ is the $i$-th data point</li> $c_j$ is the center of cluster $j$</li> </ul> Higher $m$ means fuzzier clusters. Standard practice uses $m = 2$3</a></sup> ◆</span> One thing I glossed over completely: the choice of $m$ is not arbitrary, and $m = 2$ is not always optimal. Small $m$ (close to 1) makes FCM behave like K-means with near-binary assignments. Large $m$ (above 3) flattens memberships toward uniformity and can make clusters indistinguishable. There is literature on optimizing $m$ via cluster validity indices, but in practice most people pick 2 because Bezdek said so. I have seen datasets where $m = 1.5$ gave much cleaner separation, and others where $m = 3$ was needed to avoid degenerate solutions. Experiment. — EKE, May 2026 </span> .</p> The algorithm</h3> FCM iterates between two updates until convergence:</p> 1. Update membership coefficients:</strong></p> $$ u_{ij} = \frac{1}{\sum_{k=1}^{C} \left( \frac{|x_i - c_j|}{|x_i - c_k|} \right)^{\frac{2}{m-1}}} $$</p> This says: if a point is close to cluster $j$ relative to other clusters, its membership in $j$ will be high.</p> 2. Update cluster centers:</strong></p> $$ c_j = \frac{\sum_{i=1}^{N} u_{ij}^m x_i}{\sum_{i=1}^{N} u_{ij}^m} $$</p> Each cluster center is a weighted average of all points, weighted by their membership to that cluster.</p> The algorithm repeats these steps until the objective function stops changing (or changes less than some tolerance)4</a></sup> ◆</span> A real issue I did not mention: local minima. FCM, like K-means, is sensitive to initialization. Different random starts can produce different clusterings, especially with higher $m$ or many clusters. The standard fix is to run the algorithm multiple times with different seeds and keep the run with the lowest $J_m$, but that adds computation. The skfuzzy</code> implementation does not do this automatically, and fanny()</code> in R has limited support for it. If reproducibility matters, set a seed and report it. — EKE, May 2026 </span> .</p> 📝</span> K-means is a special case of FCM.</strong> If you set $m \to 1$, the memberships become binary and FCM collapses into K-means. In practice, FCM with $m = 2$ is the standard choice.</p> </div> </div> What the output looks like</h3> After convergence, each point has a vector of $C$ membership values. A point near the core of a cluster might have membership $[0.95, 0.03, 0.02]$. A point on the boundary between two clusters might have $[0.45, 0.50, 0.05]$ ◆</span> I used to treat these vectors as pure assignment probabilities. They are not. The membership $u_{ij}$ depends on the position of all cluster centers, not just the distance to cluster $j$. If cluster centers shift because of a faraway group, the membership of a point that has not moved can change. This makes temporal or cross-dataset comparisons of membership values tricky unless the cluster centers are aligned first. — EKE, May 2026 </span> .</p> When would you use fuzzy over hard clustering?</summary> Fuzzy clustering shines when cluster boundaries are not sharp. Examples include image segmentation (a pixel can be part sky and part tree), customer segmentation (people have mixed interests), and biological data where expression states grade into each other. Use hard clustering when you need categorical assignments or when your data has natural, well-separated groups.</p> </div> </details> Fuzzy C-Means in R</h2> The cluster</code> package has fanny()</code> for fuzzy clustering ◆</span> fanny()</code> is fine for basic use, but it has limitations. It cannot handle large datasets well (the distance matrix grows quadratically), and it does not expose the fuzziness parameter $m$ directly (it uses a different parametrisation called memb.exp</code>). The ppclust</code> and fcclust</code> packages offer more modern FCM implementations with better initialization options, but fanny()</code> remains the most battle-tested. If I were writing this section today, I would also mention the clustMixType</code> package for mixed-type data, which is a common real-world scenario. — EKE, May 2026 </span> . Let us run it on the Iris dataset.</p> R</div> library(cluster) library(factoextra) library(tidyverse) iris_df <- iris %>% mutate(spec_idx = row_number()) %>% unite("species", Species, spec_idx, sep = "-", remove = TRUE) %>% column_to_rownames("species") %>% select(-species) res.fanny <- fanny(iris_df, 3) head(res.fanny$membership, 7)</code></pre> CS1: FCM on the Iris dataset using the cluster package.</div> </div> Output:</p> Code</div> [,1] [,2] [,3] setosa-1 0.9115847 0.03714162 0.05127368 setosa-2 0.8641378 0.05659841 0.07926381 setosa-3 0.8720433 0.05381542 0.07414133 setosa-4 0.8459146 0.06419306 0.08989232 setosa-5 0.9001651 0.04205859 0.05777633 setosa-6 0.7648869 0.09848692 0.13662620 setosa-7 0.8601062 0.05878600 0.08110779</code></pre> </div> The setosa points all have memberships above 0.76 in cluster 1. Clear assignment. The versicolor and virginica points will show more spread ◆</span> A detail I skipped: look at setosa-6. Its membership in cluster 1 is 0.76, noticeably lower than the others. This is a real effect, not noise. Some individual Iris plants in the Fisher dataset have petal/sepal measurements that push them toward the versicolor boundary. If I were using these memberships downstream, I would flag setosa-6 as a borderline case worth inspecting. Membership coefficients are diagnostic tools, not just output. — EKE, May 2026 </span> .</p> R</div> fviz_cluster(res.fanny, ellipse.type = "convex", palette = c("#00AFBB", "#E7B800", "#FC4E07"), ggtheme = theme_minimal(), legend = "right")</code></pre> CS2: Visualise the fuzzy clusters.</div> </div> Iris plot</div> Figure 1</div> </div> </div> Figure 1.</strong> Fuzzy cluster plot for the Iris dataset.</p> </figcaption> </div> </figure> Setosa forms a tight cluster on the left. Versicolor and virginica overlap in the middle. That overlap is exactly what the membership coefficients capture. The silhouette plot tells a similar story:</p> R</div> fviz_silhouette(res.fanny, palette = c("#00AFBB", "#E7B800", "#FC4E07"), ggtheme = theme_minimal())</code></pre> CS3: Silhouette plot for cluster quality.</div> </div> Silhouette</div> Figure 2</div> </div> </div> Figure 2.</strong> Silhouette plot for the fuzzy clustering result.</p> </figcaption> </div> </figure> The average silhouette width of 0.42 is decent. Most points fit their assigned clusters reasonably well, but the overlap between versicolor and virginica pulls the average down.</p> Fuzzy C-Means in Python</h2> Python does not have FCM in scikit-learn ◆</span> In 2026 this is still true. scikit-learn has never added fuzzy clustering to its core API. The maintainers have discussed it multiple times on GitHub issues and always punted, mainly because the demand is low relative to maintenance cost. scikit-fuzzy</code> has been the defacto standard since, but it has not seen a major release in years. If you need something production-ready with modern Python support, consider fuzzy-c-means</code> (PyPI) or implementing the update equations yourself in 30 lines of numpy. The algorithm is simple enough that a custom implementation is often cleaner than wrangling an unmaintained dependency. — EKE, May 2026 </span> . The scikit-fuzzy</code> (skfuzzy</code>) library fills the gap^{5</a></sup>. Let us generate synthetic data with three known clusters and see how FCM recovers them.</p>} Python</div> import numpy as np import skfuzzy as fuzz import matplotlib.pyplot as plt import seaborn as sns sns.set_style("white") np.random.seed(42) centers = [[1, 3], [2, 2], [3, 8]] sigmas = [[0.3, 0.5], [0.5, 0.3], [0.5, 0.3]] xpts, ypts = np.array([]), np.array([]) for (xmu, ymu), (xsigma, ysigma) in zip(centers, sigmas): xpts = np.append(xpts, np.random.normal(xmu, xsigma, 200)) ypts = np.append(ypts, np.random.normal(ymu, ysigma, 200)) plt.figure(figsize=(8, 6)) plt.scatter(xpts, ypts, c=["b"]*200 + ["orange"]*200 + ["g"]*200, s=10) plt.title("Test data: 600 points, 3 clusters") plt.show()</code></pre> CS4: Generate test data with three cluster centers.</div> </div> Test data</div> Figure 3</div> </div> </div> Figure 3.</strong> Synthetic test data with three visible clusters.</p> </figcaption> </div> </figure> Three visible clusters. The question is how many clusters FCM finds on its own. The Fuzzy Partition Coefficient (FPC) tells us. FPC ranges from 0 to 1, with 1 meaning perfectly separated clusters ◆</span> FPC and its close relative the Normalized Fuzzy Partition Coefficient (NFPC) are useful heuristics, but they have a well-known bias: they favour compact, spherical clusters with similar sizes. If your data has elongated clusters, varying densities, or very different sizes, FPC will mislead you. There are alternatives: the Xie-Beni index, the Fukuyama-Sugeno index, and the silhouette width (which works for fuzzy assignments too). I did not know about these in 2018 and relied on FPC alone. Do not make the same mistake. — EKE, May 2026 </span> . Let us fit models with 2 through 10 clusters and compare.</p> Python</div> alldata = np.vstack((xpts, ypts)) fpcs = [] fig, axes = plt.subplots(3, 3, figsize=(10, 10)) colors = ["b", "orange", "g", "r", "c", "m", "y", "k", "Brown"] for ncenters, ax in enumerate(axes.ravel(), start=2): cntr, u, _, _, _, _, fpc = fuzz.cluster.cmeans( alldata, ncenters, 2, error=0.005, maxiter=1000, init=None ) fpcs.append(fpc) cluster_membership = np.argmax(u, axis=0) for j in range(ncenters): ax.scatter(xpts[cluster_membership == j], ypts[cluster_membership == j], c=colors[j], s=8) ax.scatter(cntr[:, 0], cntr[:, 1], marker="s", c="red", s=60) ax.set_title(f"Centers = {ncenters}, FPC = {fpc:.2f}") ax.axis("off") plt.tight_layout() plt.show()</code></pre> CS5: Evaluate FPC across different cluster counts.</div> </div> Cluster sweep</div> Figure 4</div> </div> </div> Figure 4.</strong> Cluster comparison across different numbers of centers with FPC values.</p> </figcaption> </div> </figure> The FPC peaks at 2 clusters, not 3. That is unexpected. Let us check the FPC values directly:</p> Python</div> plt.figure(figsize=(8, 5)) plt.plot(range(2, 11), fpcs, "o-", color="#731810") plt.xlabel("Number of clusters") plt.ylabel("Fuzzy Partition Coefficient") plt.title("FPC vs Number of Clusters") plt.show()</code></pre> CS6: Plot FPC against number of clusters.</div> </div> FPC curve</div> Figure 5</div> </div> </div> Figure 5.</strong> Fuzzy Partition Coefficient as a function of cluster count.</p> </figcaption> </div> </figure> FPC peaks at 2 clusters (0.82) and drops steadily after that. Why would a dataset with three real clusters have a higher FPC at 2?</p> Look at the data again. The two left clusters (centers at [1,3] and [2,2]) are close together. FCM with 2 clusters merges them into one and keeps the right cluster separate. The resulting partition is cleaner in the FPC sense because the merged cluster is still compact. FPC penalises overlap, and the two left clusters overlap significantly.</p> 💡</span> FPC is not the ground truth.</strong> It tells you how clean your partition is, not whether it matches reality. Always pair FPC with domain knowledge and visual inspection. If you know your data has three meaningful groups, three clusters is the right answer regardless of what FPC says.</p> </div> </div> Closing thoughts</h2> Fuzzy clustering is not a replacement for hard clustering. It is a different tool for a different kind of problem. Use it when your data has graded boundaries, when a point can reasonably belong to multiple groups, or when you need probabilities instead of labels ◆</span> If I were writing this post today, I would add a third use case: diagnostic tool. The membership distribution across points can tell you things about your data that hard assignments hide. High-entropy membership vectors (where no cluster gets above 0.5) are a strong signal that your data does not cluster well, your feature space is poorly chosen, or the number of clusters is wrong. A hard clustering algorithm will still assign every point to some cluster and give you confident-looking labels. Fuzzy clustering forces the ambiguity into the open. That alone is worth the price of entry. — EKE, May 2026 </span> .</p> The FCM algorithm itself is simple, well-studied, and implemented in both R and Python^{6</a></sup>. Start with $m = 2$, validate with FPC and visual inspection, and treat the membership coefficients as the rich information they are ◆</span> If there is one thing I want readers to take away from this 2026 annotation, it is this: the membership coefficients are not the final answer. They are the beginning of the analysis. Plot their distributions. Check for high-entropy points. Compare them across runs with different random seeds. Cluster the membership vectors themselves to see if there are meta-clusters of points with similar assignment profiles. The membership matrix often contains structure that is invisible in the original feature space. I missed all of this in 2018. I hope you do not. — EKE, May 2026 </span> .</p>} The classic definition from Kaufman and Rousseeuw (1990), Finding Groups in Data</em>. ↩</a></p> </li> Dunn's 1973 paper introduced the fuzzy ISODATA algorithm; Bezdek generalised it into FCM in 1981. ↩</a></p> </li> Schwammle, V. & Jensen, O.N. (2010). A simple and fast method to determine the parameters for fuzzy c-means cluster analysis. Bioinformatics</em>, 26(22), 2841-2848. doi:10.1093/bioinformatics/btq534</a>. They propose a method to choose $m$ and $C$ simultaneously by finding the point where clustering on randomised data no longer detects structure. ↩</a></p> </li> Matteucci, M. A Tutorial on Clustering Algorithms - Fuzzy C-Means</em>. Available at matteucci.faculty.polimi.it</a>. A clear, visual walkthrough of the FCM algorithm with interactive demos. ↩</a></p> </li> Scikit-Fuzzy documentation. Fuzzy c-means clustering</em>. Available at pythonhosted.org/scikit-fuzzy</a>. The official example gallery for the skfuzzy library. ↩</a></p> </li> Doring, C., Lesot, M.-J. & Kruse, R. (2006). Data analysis with fuzzy clustering methods. Computational Statistics & Data Analysis</em>, 51(1), 192-214. doi:10.1016/j.csda.2006.04.030</a>. A comprehensive survey of the fuzzy clustering landscape, covering objective function methods, ACE, and FMLE. ↩</a></p> </li> </ol> </section>