Building for Scale: Our Journey from 0 to Production

The beginning: choosing the foundation

When we wrote the first line of Crezaro code, we had a choice that every technical founder faces: build for where you are or build for where you want to be. We chose a middle path. We picked technologies and patterns that would scale but did not over-engineer the initial implementation.

The core platform runs on Laravel with PHP 8.5 and FrankenPHP (via Laravel Octane). This gives us the developer productivity of Laravel with the performance of a persistent-process server. Our API response times average 45ms for read operations and 120ms for write operations.

For the payment-critical hot path, we use Go. The payment engine, webhook dispatcher, and reconciliation service are all Go microservices that communicate via gRPC internally and consume events from Kafka. This separation means the payment engine can process thousands of TPS independently of the main application.

Database: PostgreSQL, no compromises

We evaluated every database you can name: MySQL, CockroachDB, YugabyteDB, even DynamoDB. We chose PostgreSQL 18 and have not regretted it for a moment. Here is why:

SERIALIZABLE isolation: Critical for financial transactions. PostgreSQL's SSI implementation is the best in the industry.
JSONB: Metadata, processor responses, and webhook payloads are stored as JSONB columns. This gives us the flexibility of a document store with the reliability of a relational database.
Partitioning: Our transaction and ledger tables are partitioned by month. This keeps query performance consistent as data grows and makes archival straightforward.
Extensions: pg_stat_statements for query analysis, pgcrypto for encryption at rest, and pg_partman for automated partition management.

The infrastructure stack

Our production infrastructure runs on a mix of bare-metal servers (for predictable latency) and cloud services (for elasticity):

Compute: Dedicated servers for the database and payment engine, with auto-scaling cloud instances for the API and background workers
Redis 7: Caching, rate limiting, session storage, and queue management. We run Redis in cluster mode with six nodes.
Apache Kafka 3.9: Event bus for all payment lifecycle events. Three-broker cluster with topic-level replication.
ClickHouse: Analytics and reporting. Transaction data is streamed from Kafka to ClickHouse for real-time dashboards.
Meilisearch: Full-text search across transactions, customers, and logs.

Monitoring: the unglamorous necessity

We built our monitoring infrastructure before we built most of the product. This is counterintuitive but necessary for a payment system. When something goes wrong with money, you need to know in seconds, not minutes.

Every transaction passes through 47 distinct metrics, including processing time by stage, processor response codes, fraud score distributions, and settlement accuracy. We use Prometheus for metrics collection, Grafana for visualization, and PagerDuty for alerting.

Our alert thresholds are aggressive. If the p95 payment processing time exceeds 500ms for more than 60 seconds, someone gets paged. If the success rate for any processor drops below 95%, someone gets paged. If the ledger balance check fails for any account, someone gets paged immediately.

The best time to build monitoring is before you need it. The second-best time is now.

Mistakes we made (and fixed)

We are not going to pretend the journey was smooth. Here are three significant mistakes we made and how we recovered:

1. Synchronous webhook delivery. Our first webhook implementation delivered webhooks synchronously during payment processing. This meant a slow merchant endpoint could delay the payment response. We moved to asynchronous delivery via Kafka within the first month of production.

2. Monolithic deployment. We initially deployed everything (API, admin dashboard, payment engine) as a single application. The first time we needed to scale the payment engine independently, we realized this was not going to work. We extracted the Go services over a three-week sprint.

3. Insufficient load testing. Our initial load tests simulated 100 TPS. Our first real traffic spike hit 800 TPS. The system handled it, but barely. We now load test to 10x our expected peak regularly.

Where we are today

Crezaro processes sustained throughput of 5,000+ TPS with 99.99% uptime. Our p99 API latency is under 200ms. We have processed billions of naira with zero ledger discrepancies.

But we are far from done. We are currently working on multi-region deployment for geographic redundancy, edge caching for our checkout pages, and a real-time streaming API for large merchants who need instant transaction notifications.

If you are a fellow engineer who enjoys this kind of problem, we are hiring. And if you want to build on top of infrastructure that handles this complexity, our APIs are ready.

Building for Scale: Our Journey from 0 to Production

The beginning: choosing the foundation

Database: PostgreSQL, no compromises

The infrastructure stack

Monitoring: the unglamorous necessity

Mistakes we made (and fixed)

Where we are today

Samuel Olaoye

More from the blog

The Crezaro
Newsletter

Building for Scale: Our Journey from 0 to Production

The beginning: choosing the foundation

Database: PostgreSQL, no compromises

The infrastructure stack

Monitoring: the unglamorous necessity

Mistakes we made (and fixed)

Where we are today

Samuel Olaoye

More from the blog

How We Process Thousands of Transactions Per Second

The Crezaro Newsletter

The Crezaro
Newsletter