Technical Architect's Field Manual
Indrajith's — Reference Documents — April 2026
A practical guide covering solution design, non-functional requirements, governance, cloud readiness, estimation, and GenAI architecture.
Preface
This book is written for a Technical Architect who needs to be prepared, not just informed. Being informed means you have read about architectural concepts. Being prepared means you can walk into a room, face a real problem, and produce a structured, defensible approach under pressure.
Every module follows the same structure: why it matters, the core concepts explained plainly, real-world case studies, and practice exercises. Read a module, do the exercise, then move on. Do not read this cover to cover in one sitting.
Note: This book references Robert C. Martin's Clean Architecture throughout relevant modules. These references are not required reading. This book stands on its own.
What Makes a Good Technical Architect?
- They think in trade-offs, not solutions. There is no perfect architecture. Every decision gives something up to gain something else.
- They communicate at the right altitude. To an executive: cost and timeline impact. To a developer: patterns and interfaces. Same decision, different altitude.
- They make the implicit explicit. The most dangerous architectural decisions are the ones nobody realised were decisions.
- They are comfortable with uncertainty. Architecture is practised under incomplete information. The skill is making sound decisions in spite of it.
Module 01 — Solution Design
When a stakeholder brings you a problem, your first job is not to produce a diagram. Your first job is to understand the problem well enough that you could explain it back more clearly than they explained it to you.
Architectural Styles
An architectural style is a named, well-understood approach to organising a system. It is a starting point, not a solution.
Layered (N-Tier) Architecture
The system is divided into horizontal layers, each with a specific responsibility. Each layer only communicates with the layer directly below it.
Works well: Line-of-business applications with clear CRUD operations, small-to-medium teams, well-understood requirements upfront.
Breaks down: At scale, layers become monolithic slabs. Business logic leaks into the presentation layer. This is the big ball of mud anti-pattern.
Warning — The Sinkhole Anti-Pattern: If more than 20% of your requests pass straight through every layer without transformation, reconsider whether you need all the layers you have defined.
Microservices Architecture
The system is divided into small, independently deployable services, each owning its own data and communicating via well-defined APIs.
Works well: Large organisations where multiple teams need to deploy independently. Systems where components have very different scaling needs.
Breaks down: Small teams where operational overhead outweighs the independence gained. Microservices sharing a database are not microservices — they are a distributed monolith, which is the worst of both worlds.
Event-Driven Architecture
Components communicate by producing and consuming events through a message broker. Producers do not know who consumes their events.
Works well: Workflows where multiple things must happen in response to one action. Systems that need to absorb bursty load. Audit logging where every event is a record of what happened.
Breaks down: Systems requiring strong consistency. Debugging is significantly harder than in synchronous systems.
Hexagonal Architecture (Ports and Adapters)
The application core (domain logic) sits at the centre, completely ignorant of the outside world. It defines ports (interfaces). Adapters translate between the outside world and those interfaces.
A port is an interface defined by the application core, such as IUserRepository. An adapter is a concrete implementation of that interface, such as PostgresUserRepository. The core defines what it needs; adapters fulfil those needs without the core knowing how.
The C4 Model
Created by Simon Brown, the C4 model provides four levels of zoom for communicating architecture. The most common failure is altitude mismatch: explaining infrastructure details to a CEO, or system context to a developer who needs to know which class to modify.
Design Principles
Coupling and Cohesion: Coupling is the degree to which one component depends on another. Cohesion is the degree to which elements within a component belong together. The goal is always high cohesion, low coupling.
The Dependency Rule: Source code dependencies must always point inward — from outer layers (frameworks, databases, UI) toward inner layers (use cases, entities). The business rules at the centre must know nothing about the database, the web framework, or the UI library.
Decision-Making Framework:
- Frame the problem. What constraint forces this decision? What business goals must it serve?
- List the options, including the option of doing nothing. Evaluate at minimum three options.
- Identify the trade-offs. What does each option gain? What does it give up?
- Make the decision explicit. Name the decision, the chosen option, and the rationale.
- Document the context. Future architects need to know why this was correct at the time.
Case Study — Uber's Architecture Evolution
In 2014, Uber was a monolithic Python application using the Flask framework, backed by a single PostgreSQL database. As it expanded to new cities, the monolith became a bottleneck. Deployments broke unrelated features. A bug in the payments module could take down the driver location service.
Uber began migrating to microservices using the Strangler Fig pattern — new services were built alongside the monolith, and traffic was gradually routed to them. Trip management, user accounts, payments, and driver dispatch were extracted first because they had distinct data models and distinct teams.
Uber's microservices eventually numbered in the thousands. Operational overhead became significant. They later consolidated services and invested heavily in internal tooling. Microservices solve one class of problems while creating another.
Practice Exercise (30 minutes): Choose a system you have worked on. Produce two diagrams on paper: a C4 Level 1 (System Context) and a C4 Level 2 (Container). Then write one paragraph answering: "Why did I choose this architectural style, and what would change if user volume increased 100x?"
Module 02 — Non-Functional Requirements
A system that does what it is supposed to do but crashes under load, leaks data, or responds in 30 seconds is not a working system — it is a liability. NFRs define the operational qualities that make a system fit for production. They are invisible until violated.
NFR Categories
| Category | What It Means | Example Metric |
|---|---|---|
| Performance | Response time and throughput under load | API response < 200ms at p95 under 1,000 RPS |
| Scalability | Ability to handle growth in users, data, transactions | Handle 10x current load within 6 months |
| Availability | Uptime and resilience to failure | 99.9% uptime = max 8.7 hours downtime per year |
| Reliability | Correctness and consistency of behaviour over time | Zero data loss on payment transactions |
| Security | Protection from unauthorised access or data breach | All PII encrypted at rest and in transit (AES-256) |
| Maintainability | Ease of modifying or extending the system | A new developer can make a code change within 1 day |
| Observability | Ability to understand system state from its outputs | Full distributed tracing across all services |
| Disaster Recovery | How fast you recover from a major failure | RTO < 4 hours, RPO < 1 hour |
Note: Always express performance NFRs in percentiles — p50, p95, p99 — never averages. An average of 100ms can hide the fact that 1% of requests take 30 seconds.
Availability Numbers
| Availability | Downtime per Year | Downtime per Month | Typical Use Case |
|---|---|---|---|
| 99% (two nines) | ~87.6 hours | ~7.3 hours | Internal tools, dev environments |
| 99.9% (three nines) | ~8.77 hours | ~43.8 minutes | Most business applications |
| 99.95% | ~4.38 hours | ~21.9 minutes | Consumer-facing SaaS products |
| 99.99% (four nines) | ~52.6 minutes | ~4.4 minutes | Financial systems, healthcare |
| 99.999% (five nines) | ~5.26 minutes | ~26.3 seconds | Telecommunications, life-critical |
The cost of achieving each additional nine increases exponentially. Always ask: "What is the business cost of each additional nine, and is it worth the engineering cost?"
RTO and RPO
RTO — Recovery Time Objective: The maximum acceptable length of time that the system can be offline after a failure before it causes unacceptable business damage. Drives your recovery automation, failover speed, and on-call staffing.
RPO — Recovery Point Objective: The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you must back up or replicate data at least every hour. An RPO of zero requires synchronous replication to a standby.
Writing SMART NFRs
Every NFR must be Specific, Measurable, Achievable, Relevant, and Time-bound. Vague NFRs cannot be tested, enforced, or used to drive architectural decisions.
| Vague (Useless) | SMART (Useful) |
|---|---|
| The system should be fast | The product search API must return results in under 300ms at the 95th percentile under a sustained load of 500 concurrent users with production-representative data. |
| The system must be secure | All user passwords must be hashed using bcrypt with a cost factor of minimum 12. No PII may appear in application logs. All API tokens must expire within 24 hours. |
| The system should scale | The system must maintain p95 response times below 500ms when concurrency increases from 100 to 10,000 users via horizontal scaling with no code changes required. |
| The system should be available | The payment service must achieve 99.95% uptime measured monthly, excluding planned maintenance windows communicated 72 hours in advance, not exceeding 2 hours per month. |
Case Study — Netflix and Chaos Engineering
In 2010, Netflix began migrating from data centres to AWS. Their engineers realised they could not guarantee reliability by preventing failures — at cloud scale, individual components would fail constantly. The question was not whether failures would occur but whether the system would survive them.
Rather than writing traditional availability NFRs, Netflix defined a resilience requirement: the system must continue serving user-facing features gracefully even when individual backend instances are randomly terminated during business hours. They built Chaos Monkey in 2011 to enforce this requirement in production. Netflix released Chaos Monkey as open source in 2012 under the Apache 2.0 licence, expanding it into the Simian Army.
This single resilience NFR drove mandatory architectural decisions: every service needed circuit breakers, all downstream calls needed timeouts and fallbacks, every feature needed a degraded mode, and all services needed to be stateless so any instance could be terminated without data loss.
Practice Exercise (25 minutes): Choose any application you have worked on. Write 6 NFRs covering: performance, scalability, availability, security, observability, and one of your choice. For each, write it first in vague form then rewrite it in SMART form. Identify which two NFRs are in tension with each other.
Module 03 — Governance & Architecture Decision Records
Without governance, every team makes different decisions, the codebase fragments, and institutional knowledge lives only in people's heads. Governance is how you scale architectural consistency across teams without becoming a bottleneck.
Architecture Decision Records (ADRs)
An ADR is a short document — one to two pages maximum — that captures a single architectural decision, the context that forced it, the options considered, and the rationale for the choice made.
| ADR Section | What to Write | Common Mistake |
|---|---|---|
| Title | Short, imperative, specific. "Use PostgreSQL for all transactional data." | Vague titles that do not communicate the decision |
| Status | Proposed / Accepted / Deprecated / Superseded (by ADR-023) | Never updating status when decisions change |
| Context | What forces are at play? What constraints exist? Write as if explaining to someone not in the room. | Writing context as justification for a decision already made |
| Decision | The specific choice, active voice. "We will use PostgreSQL 15 with read replicas." | Describing the process of deciding rather than the decision |
| Consequences | What becomes easier? What becomes harder? What risks does this introduce? | Only listing positive consequences |
| Alternatives Considered | Other options evaluated and why they were rejected | Omitting this section entirely |
Note: The "Alternatives Considered" section is the most valuable part for future readers. Without it, a new engineer cannot know that an option was already evaluated and rejected. They may spend weeks investigating something already dismissed.
Architecture Review Boards
An ARB is a governance forum where significant architectural decisions are reviewed before implementation.
ARB trigger criteria: New technology adoption, cross-team architectural changes, changes to shared infrastructure, decisions with significant security or cost implications, and any decision where the cost of being wrong is high.
Case Study — Amazon's API Mandate
In 2002, Amazon was struggling with a codebase so deeply interconnected that teams could not work independently. Jeff Bezos issued the "API Mandate": all teams must expose their data and functionality through service interfaces; teams must communicate only through these interfaces, with no shared databases and no direct linking; all interfaces must be designed to be externally exposable.
This was a governance decision, not just an architectural one. It changed how teams were evaluated and how data was accessed. The mandate created the foundation for what eventually became Amazon Web Services — APIs built for internal use turned out to be sellable externally because they were designed with strict interface contracts from the beginning.
Practice Exercise (30 minutes): Write a complete ADR for a real architectural decision from your current or most recent project. Write all six sections. Write at least two rejected alternatives with reasons. Then answer: "If someone reads this ADR in three years, what context might they be missing?"
Module 04 — Cloud Readiness
There is a critical distinction between an application that runs in the cloud and one that is cloud-native. Running in the cloud means you have moved your virtual machines from a data centre to AWS. Cloud-native means your application is designed to exploit cloud capabilities: elastic scaling, managed services, pay-per-use cost models, and geographic distribution.
The 6 Rs of Cloud Migration
| Strategy | Description | Effort | Cloud Benefit |
|---|---|---|---|
| Rehost (Lift & Shift) | Move as-is to cloud VMs, no code changes | Low | Minimal |
| Replatform (Lift & Reshape) | Small optimisations, e.g. move to managed RDS | Low–Medium | Moderate |
| Repurchase | Replace entirely with a SaaS product | Medium | High |
| Refactor / Re-architect | Redesign as cloud-native | High | Very High |
| Retire | Decommission — application no longer needed | Low | N/A |
| Retain | Keep on-premise — regulatory or risk reasons | None | None |
The 12-Factor App
Authored by Adam Wiggins (Heroku co-founder) and published in 2011. The four most architecturally significant factors:
| Factor | What It Means | Architectural Impact |
|---|---|---|
| III. Config in environment | No hardcoded config. Use environment variables or a config service. | Enables deploying the same artefact to dev, staging, and prod |
| VI. Stateless processes | No persistent data in memory between requests. State lives in backing services. | Makes horizontal scaling possible — any instance can serve any request |
| IX. Disposability | Processes start fast and shut down gracefully. | Enables auto-scaling and zero-downtime deployments |
| XI. Logs as event streams | Applications write logs to stdout; infrastructure handles aggregation. | Application has no knowledge of log destination — deployable anywhere |
Cloud-Native Resilience Patterns
Circuit Breaker
Other Key Patterns
- Bulkhead: Isolate failures by partitioning resources (thread pools, connection pools) so a failure in one area cannot consume all shared resources.
- Retry with Exponential Backoff: Retry transient failures with increasing wait times plus jitter to avoid thundering herd problems.
- Sidecar: Attach a helper process alongside each service for cross-cutting concerns (logging, tracing, auth, TLS termination).
- Blue/Green Deployment: Run two identical environments; route traffic to the new one, roll back instantly if needed.
- Health Endpoint: Every service exposes a
/healthendpoint so orchestrators can detect and replace failed instances.
Case Study — Airbnb's Migration from Monolith
By 2017, Airbnb's Rails monolith had grown to millions of lines of code. Build times exceeded 45 minutes. Before migrating, engineers assessed the monolith against cloud-native principles and found critical violations: session state in server memory (violating Factor VI), configuration embedded in YAML files committed to version control (violating Factor III), startup time of 8 minutes (violating Factor IX), and logs written to local disk (violating Factor XI).
Airbnb addressed the 12-Factor violations first (sessions moved to Redis, config moved to Consul), then began extracting services at natural domain boundaries. The search service was extracted first because it had the clearest data boundary and the highest independent scaling need.
Practice Exercise (30 minutes): Choose a system you know. Apply the 6 Rs. Check the system against the four most important 12-Factor principles. Where does it fall short? Identify the single most important cloud-native pattern it is missing. Estimate the migration effort using T-shirt sizing.
Module 05 — Estimation
The goal of estimation is not accuracy. The goal is calibrated confidence — communicating what you know, what you do not know, and how confident you are, in terms a business stakeholder can act on.
T-Shirt Sizing
| Size | Approximate Effort | Characteristics | Examples |
|---|---|---|---|
| XS | 1–3 days | Well-understood, single developer, no dependencies | Adding a field to an API response |
| S | 1–2 weeks | Clear scope, single team, some implementation unknowns | New API endpoint; integration with a known SDK |
| M | 2–6 weeks | Multiple components, cross-team coordination, meaningful unknowns | New microservice extraction; auth system replacement |
| L | 2–4 months | Significant complexity, multiple teams, unclear scope areas | Cloud migration of an application; new event-driven pipeline |
| XL | 4+ months | High complexity, many unknowns — break into phases | Full microservices migration; new product platform |
Note: An XL estimate is a signal, not a commitment. The right response is: "How do we break this into deliverable phases where each phase provides business value?"
Three-Point Estimation (PERT)
Program Evaluation and Review Technique gives a statistically defensible expected value while explicitly modelling uncertainty.
- Optimistic (O): Everything goes right. No unexpected problems.
- Most Likely (M): Realistic case with normal friction and rework.
- Pessimistic (P): Significant problems occur. Dependencies are late. Complexity was underestimated.
Expected = (O + 4M + P) / 6
Std Dev = (P - O) / 6
Worked Example — OAuth 2.0 Migration:
| Scenario | Weeks | Reasoning |
|---|---|---|
| Optimistic (O) | 4 | Library works first time, no legacy edge cases, team has done this before |
| Most Likely (M) | 7 | 2 weeks discovery, 3 weeks implementation, 2 weeks testing and fixes |
| Pessimistic (P) | 14 | Legacy session handling deeply embedded, clients need coordinated migration, security review required |
Expected = (4 + 4×7 + 14) / 6 = 46 / 6 ≈ 7.7 weeks
Std Dev = (14 - 4) / 6 ≈ 1.7 weeks
68% confidence interval: 6 to 9.5 weeks
Cone of Uncertainty
The earlier in a project you estimate, the wider the range of error. This is not a failure of skill — it is a mathematical reality. Estimation error decreases as a project progresses and more information is known.
Warning — The Commitment Trap: The most dangerous moment in estimation is when a stakeholder receives a wide early-phase estimate and asks you to "just give a number for planning purposes." The number you give becomes the plan. Caveats evaporate. "I cannot give you a number more accurate than X until we complete the discovery phase" is one of the most important sentences a Technical Architect can say.
Case Study — Healthcare.gov Launch Failure
Healthcare.gov, the US federal health insurance marketplace, launched in October 2013 and immediately failed. On its first day, only 6 people successfully enrolled despite 250,000 visitors. Load estimates were based on optimistic scenarios. The database was not load-tested until two weeks before launch. Integration testing of the 55 separate agencies' systems was not completed until the final weeks.
The most critical failure was in integration complexity estimation. Integration complexity does not scale linearly: two systems have one integration point, three systems have three, ten systems have 45 possible interaction points. This was not adequately accounted for.
Practice Exercise (25 minutes): Estimate an OAuth 2.0 migration where 40 client applications consume the current LDAP-based system and the team has never implemented OAuth 2.0 before. Document your O/M/P scenarios, top 3 assumptions, top 3 risks, and what a 2-week discovery phase should investigate.
Module 06 — GenAI Architecture
Large Language Models are not deterministic systems. Send the same prompt twice and you may receive different responses. Ask them a factual question and they may answer confidently with fabricated information. Integrating LLMs into production systems requires patterns specifically designed to manage non-determinism, latency, cost, and trust.
Retrieval Augmented Generation (RAG)
RAG solves the fundamental problem that LLMs know nothing about your organisation's specific documents, policies, or internal knowledge. RAG retrieves relevant context from your own data store and provides it to the LLM as part of the prompt.
Query-Time Flow
Ingestion Pipeline (Offline)
Agents and Tool Use
Pure RAG is a read-only pattern. Agents extend this by allowing the LLM to call functions, query APIs, execute code, and take real-world actions. This dramatically increases capability and risk simultaneously.
Warning: Design every agentic system with explicit action allowlists, confirmation steps before irreversible actions, rate limiting per user and per action type, and comprehensive logging. A malicious user could inject instructions through a document the agent reads (prompt injection), overriding your system prompt.
Risks and Mitigations
| Risk | Description | Architectural Mitigation |
|---|---|---|
| Hallucination | LLM generates plausible-sounding but factually incorrect information | Use RAG to ground responses in verified documents. Build evaluation harnesses with golden datasets. |
| Latency | LLM API calls take 1–30 seconds | Streaming responses, async processing, response caching, smaller models for latency-sensitive paths. |
| Cost | LLMs charge per token; complex prompts are expensive at scale | Token budget management, prompt caching, smaller models for classification tasks, monitor cost per request. |
| Data Privacy | User data sent to third-party LLM APIs may violate GDPR or HIPAA | PII detection and redaction before sending to external APIs. Audit logs of what data was sent. |
| Prompt Injection | Malicious users embed instructions that override system prompts | Input sanitisation, privilege separation, output validation, anomaly monitoring. |
| Model Version Drift | Providers update models without notice, changing behaviour | Pin to specific model versions. Maintain regression test suites. |
| Non-Determinism | Same prompt produces different outputs | Set temperature to 0 for deterministic tasks. Build probabilistic evaluation harnesses. |
Case Study — GitHub Copilot
Code suggestions that arrive more than 200ms after the user pauses typing feel unresponsive. Copilot uses a purpose-trained model rather than a general-purpose model — a smaller, faster model achieves quality competitive with larger models on the specific task of code completion. The prompt is carefully engineered to include only the most relevant context (surrounding code, related files, open tabs), keeping token counts, and therefore latency and cost, bounded.
GitHub cannot test Copilot by checking if generated code is "correct" — correctness is domain-specific. Instead they evaluate on proxy metrics: acceptance rate (did the developer accept the suggestion?), persistence rate (is the accepted suggestion still in the code 30 seconds later?), and latency distribution (what percentage of suggestions arrive within 200ms?).
Practice Exercise (35 minutes): Design a RAG-based internal knowledge assistant for a 5,000-person company. Produce: a C4 Level 2 container diagram; the top 5 NFRs with SMART metrics; the top 3 risks with specific mitigations; an explanation of how you would handle document freshness; and a cost model.
Artifact Templates
Architecture Proposal Template
| Section | Content | Length |
|---|---|---|
| 1. Executive Summary | Problem, proposed solution, key trade-offs | 1 paragraph |
| 2. Problem Statement | Business problem, who is affected, constraints | 1–2 paragraphs |
| 3. Assumptions & Constraints | All assumptions listed explicitly | Bulleted list |
| 4. Proposed Architecture | C4 Level 1 + Level 2 diagrams, one paragraph per major component | Diagrams + 1 page |
| 5. Key Decisions | 3–5 decisions in mini-ADR format | ½ page per decision |
| 6. NFR Coverage | Top 5 NFRs with SMART metrics and how the architecture addresses each | Table |
| 7. Risks & Mitigations | Top 3–5 risks with likelihood, impact (H/M/L), and specific mitigation | Table |
| 8. Open Questions | What would you need to know to finalise this design? | Bulleted list |
NFR Specification Template
| Attribute | Content |
|---|---|
| Category | Performance / Scalability / Availability / Security / Observability / etc. |
| Statement | The specific, measurable requirement |
| Measurement Method | How this will be verified — load test, security scan, uptime monitor |
| Acceptance Threshold | Pass/fail criteria with specific numbers |
| Priority | Must Have / Should Have / Nice to Have |
| Architectural Impact | Which architectural decisions this NFR drives |
Glossary
| Term | Definition | Module |
|---|---|---|
| ADR | Architecture Decision Record — a document capturing a single architectural decision, its context, alternatives, and consequences | 03 |
| ARB | Architecture Review Board — a governance forum for reviewing significant architectural decisions | 03 |
| Availability | The proportion of time a system is operational, expressed as a percentage | 02 |
| Bounded Context | A logical boundary within which a domain model is internally consistent | 01 |
| C4 Model | A hierarchical diagramming approach with four levels: Context, Container, Component, Code | 01 |
| Circuit Breaker | A pattern that prevents cascade failures by stopping calls to a failing service after a threshold | 04 |
| Cohesion | The degree to which elements within a component belong together. High cohesion is desirable. | 01 |
| Coupling | The degree of dependency between components. Low coupling is desirable. | 01 |
| Embedding | A numerical vector representation of text capturing semantic meaning | 06 |
| Hallucination | An LLM generating plausible-sounding but factually incorrect information | 06 |
| NFR | Non-Functional Requirement — a requirement specifying how well a system performs its functions | 02 |
| PERT | Program Evaluation and Review Technique — three-point estimation using O, M, and P estimates | 05 |
| Prompt Injection | An attack where malicious input overrides an LLM system prompt | 06 |
| RAG | Retrieval Augmented Generation — grounds LLM responses in retrieved documents | 06 |
| RPO | Recovery Point Objective — the maximum acceptable data loss measured in time | 02 |
| RTO | Recovery Time Objective — the maximum acceptable duration of downtime after a failure | 02 |
| Strangler Fig Pattern | Migrating a legacy system by gradually extracting components over time | 01 |
| T-Shirt Sizing | Relative estimation using XS/S/M/L/XL categories | 05 |
| 12-Factor App | A methodology defining 12 characteristics of applications designed for cloud deployment, authored by Adam Wiggins in 2011 | 04 |
| Vector Database | A database optimised for storing and querying embedding vectors | 06 |
Indrajith's — Reference Documents — April 2026