Technical Architect's Field Manual

Indrajith's — Reference Documents — April 2026

A practical guide covering solution design, non-functional requirements, governance, cloud readiness, estimation, and GenAI architecture.


Preface

This book is written for a Technical Architect who needs to be prepared, not just informed. Being informed means you have read about architectural concepts. Being prepared means you can walk into a room, face a real problem, and produce a structured, defensible approach under pressure.

Every module follows the same structure: why it matters, the core concepts explained plainly, real-world case studies, and practice exercises. Read a module, do the exercise, then move on. Do not read this cover to cover in one sitting.

Note: This book references Robert C. Martin's Clean Architecture throughout relevant modules. These references are not required reading. This book stands on its own.

What Makes a Good Technical Architect?


Module 01 — Solution Design

When a stakeholder brings you a problem, your first job is not to produce a diagram. Your first job is to understand the problem well enough that you could explain it back more clearly than they explained it to you.

Architectural Styles

An architectural style is a named, well-understood approach to organising a system. It is a starting point, not a solution.

Layered (N-Tier) Architecture

The system is divided into horizontal layers, each with a specific responsibility. Each layer only communicates with the layer directly below it.

flowchart TD A["Presentation Layer — UI, API Controllers"] B["Business Layer — Business rules, use cases"] C["Persistence Layer — Repositories, data access"] D["Database Layer — SQL / NoSQL storage"] A --> B --> C --> D

Works well: Line-of-business applications with clear CRUD operations, small-to-medium teams, well-understood requirements upfront.

Breaks down: At scale, layers become monolithic slabs. Business logic leaks into the presentation layer. This is the big ball of mud anti-pattern.

Warning — The Sinkhole Anti-Pattern: If more than 20% of your requests pass straight through every layer without transformation, reconsider whether you need all the layers you have defined.

Microservices Architecture

The system is divided into small, independently deployable services, each owning its own data and communicating via well-defined APIs.

flowchart LR Client --> GW["API Gateway\nrouting, auth, rate limiting"] GW --> US["User Service"] GW --> OS["Order Service"] GW --> PS["Payment Service"] US --- DB1[("Users DB")] OS --- DB2[("Orders DB")] PS --- DB3[("Payments DB")]

Works well: Large organisations where multiple teams need to deploy independently. Systems where components have very different scaling needs.

Breaks down: Small teams where operational overhead outweighs the independence gained. Microservices sharing a database are not microservices — they are a distributed monolith, which is the worst of both worlds.

Event-Driven Architecture

Components communicate by producing and consuming events through a message broker. Producers do not know who consumes their events.

flowchart LR OS["Order Service"] -- "OrderPlaced event" --> MB["Message Broker\nKafka / RabbitMQ"] MB --> ES["Email Service"] MB --> IS["Inventory Service"] MB --> AS["Analytics Service"]

Works well: Workflows where multiple things must happen in response to one action. Systems that need to absorb bursty load. Audit logging where every event is a record of what happened.

Breaks down: Systems requiring strong consistency. Debugging is significantly harder than in synchronous systems.

Hexagonal Architecture (Ports and Adapters)

The application core (domain logic) sits at the centre, completely ignorant of the outside world. It defines ports (interfaces). Adapters translate between the outside world and those interfaces.

flowchart LR subgraph "External Drivers" REST & CLI & GRPC end subgraph "Application Core" IP["Input Port (interface)"] --> LOGIC["Domain Logic"] LOGIC --> OP["Output Port (interface)"] end subgraph "External Driven" DB["DB Adapter"] & MQ["MQ Adapter"] end REST & CLI & GRPC --> IP OP --> DB & MQ

A port is an interface defined by the application core, such as IUserRepository. An adapter is a concrete implementation of that interface, such as PostgresUserRepository. The core defines what it needs; adapters fulfil those needs without the core knowing how.

The C4 Model

Created by Simon Brown, the C4 model provides four levels of zoom for communicating architecture. The most common failure is altitude mismatch: explaining infrastructure details to a CEO, or system context to a developer who needs to know which class to modify.

flowchart TD L1["Level 1 — System Context\nAudience: Everyone including non-technical\nShows: Your system, external users, external systems"] L2["Level 2 — Container\nAudience: Technical stakeholders, architects\nShows: Web app, API, database, message queue"] L3["Level 3 — Component\nAudience: Developers on that container\nShows: Major components inside one container"] L4["Level 4 — Code\nAudience: Individual developers\nShows: Classes, functions, interfaces"] L1 --> L2 --> L3 --> L4

Design Principles

Coupling and Cohesion: Coupling is the degree to which one component depends on another. Cohesion is the degree to which elements within a component belong together. The goal is always high cohesion, low coupling.

The Dependency Rule: Source code dependencies must always point inward — from outer layers (frameworks, databases, UI) toward inner layers (use cases, entities). The business rules at the centre must know nothing about the database, the web framework, or the UI library.

Decision-Making Framework:

  1. Frame the problem. What constraint forces this decision? What business goals must it serve?
  2. List the options, including the option of doing nothing. Evaluate at minimum three options.
  3. Identify the trade-offs. What does each option gain? What does it give up?
  4. Make the decision explicit. Name the decision, the chosen option, and the rationale.
  5. Document the context. Future architects need to know why this was correct at the time.

Case Study — Uber's Architecture Evolution

In 2014, Uber was a monolithic Python application using the Flask framework, backed by a single PostgreSQL database. As it expanded to new cities, the monolith became a bottleneck. Deployments broke unrelated features. A bug in the payments module could take down the driver location service.

Uber began migrating to microservices using the Strangler Fig pattern — new services were built alongside the monolith, and traffic was gradually routed to them. Trip management, user accounts, payments, and driver dispatch were extracted first because they had distinct data models and distinct teams.

Uber's microservices eventually numbered in the thousands. Operational overhead became significant. They later consolidated services and invested heavily in internal tooling. Microservices solve one class of problems while creating another.

Practice Exercise (30 minutes): Choose a system you have worked on. Produce two diagrams on paper: a C4 Level 1 (System Context) and a C4 Level 2 (Container). Then write one paragraph answering: "Why did I choose this architectural style, and what would change if user volume increased 100x?"

Module 02 — Non-Functional Requirements

A system that does what it is supposed to do but crashes under load, leaks data, or responds in 30 seconds is not a working system — it is a liability. NFRs define the operational qualities that make a system fit for production. They are invisible until violated.

NFR Categories

CategoryWhat It MeansExample Metric
PerformanceResponse time and throughput under loadAPI response < 200ms at p95 under 1,000 RPS
ScalabilityAbility to handle growth in users, data, transactionsHandle 10x current load within 6 months
AvailabilityUptime and resilience to failure99.9% uptime = max 8.7 hours downtime per year
ReliabilityCorrectness and consistency of behaviour over timeZero data loss on payment transactions
SecurityProtection from unauthorised access or data breachAll PII encrypted at rest and in transit (AES-256)
MaintainabilityEase of modifying or extending the systemA new developer can make a code change within 1 day
ObservabilityAbility to understand system state from its outputsFull distributed tracing across all services
Disaster RecoveryHow fast you recover from a major failureRTO < 4 hours, RPO < 1 hour
Note: Always express performance NFRs in percentiles — p50, p95, p99 — never averages. An average of 100ms can hide the fact that 1% of requests take 30 seconds.

Availability Numbers

AvailabilityDowntime per YearDowntime per MonthTypical Use Case
99% (two nines)~87.6 hours~7.3 hoursInternal tools, dev environments
99.9% (three nines)~8.77 hours~43.8 minutesMost business applications
99.95%~4.38 hours~21.9 minutesConsumer-facing SaaS products
99.99% (four nines)~52.6 minutes~4.4 minutesFinancial systems, healthcare
99.999% (five nines)~5.26 minutes~26.3 secondsTelecommunications, life-critical

The cost of achieving each additional nine increases exponentially. Always ask: "What is the business cost of each additional nine, and is it worth the engineering cost?"

RTO and RPO

RTO — Recovery Time Objective: The maximum acceptable length of time that the system can be offline after a failure before it causes unacceptable business damage. Drives your recovery automation, failover speed, and on-call staffing.

RPO — Recovery Point Objective: The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you must back up or replicate data at least every hour. An RPO of zero requires synchronous replication to a standby.

timeline title RTO and RPO on a Failure Timeline section Before Failure Last backup taken : RPO window begins here section Failure Point System goes down : RPO = data potentially lost section After Failure Recovery begins : RTO window begins here System restored : RTO = total downtime window

Writing SMART NFRs

Every NFR must be Specific, Measurable, Achievable, Relevant, and Time-bound. Vague NFRs cannot be tested, enforced, or used to drive architectural decisions.

Vague (Useless)SMART (Useful)
The system should be fastThe product search API must return results in under 300ms at the 95th percentile under a sustained load of 500 concurrent users with production-representative data.
The system must be secureAll user passwords must be hashed using bcrypt with a cost factor of minimum 12. No PII may appear in application logs. All API tokens must expire within 24 hours.
The system should scaleThe system must maintain p95 response times below 500ms when concurrency increases from 100 to 10,000 users via horizontal scaling with no code changes required.
The system should be availableThe payment service must achieve 99.95% uptime measured monthly, excluding planned maintenance windows communicated 72 hours in advance, not exceeding 2 hours per month.

Case Study — Netflix and Chaos Engineering

In 2010, Netflix began migrating from data centres to AWS. Their engineers realised they could not guarantee reliability by preventing failures — at cloud scale, individual components would fail constantly. The question was not whether failures would occur but whether the system would survive them.

Rather than writing traditional availability NFRs, Netflix defined a resilience requirement: the system must continue serving user-facing features gracefully even when individual backend instances are randomly terminated during business hours. They built Chaos Monkey in 2011 to enforce this requirement in production. Netflix released Chaos Monkey as open source in 2012 under the Apache 2.0 licence, expanding it into the Simian Army.

This single resilience NFR drove mandatory architectural decisions: every service needed circuit breakers, all downstream calls needed timeouts and fallbacks, every feature needed a degraded mode, and all services needed to be stateless so any instance could be terminated without data loss.

Practice Exercise (25 minutes): Choose any application you have worked on. Write 6 NFRs covering: performance, scalability, availability, security, observability, and one of your choice. For each, write it first in vague form then rewrite it in SMART form. Identify which two NFRs are in tension with each other.

Module 03 — Governance & Architecture Decision Records

Without governance, every team makes different decisions, the codebase fragments, and institutional knowledge lives only in people's heads. Governance is how you scale architectural consistency across teams without becoming a bottleneck.

Architecture Decision Records (ADRs)

An ADR is a short document — one to two pages maximum — that captures a single architectural decision, the context that forced it, the options considered, and the rationale for the choice made.

stateDiagram-v2 [*] --> Proposed : New decision identified Proposed --> Accepted : Review complete Proposed --> Rejected : Alternative chosen Accepted --> Deprecated : Context has changed Accepted --> Superseded : Replaced by newer ADR Deprecated --> [*] Superseded --> [*] Rejected --> [*]
ADR SectionWhat to WriteCommon Mistake
TitleShort, imperative, specific. "Use PostgreSQL for all transactional data."Vague titles that do not communicate the decision
StatusProposed / Accepted / Deprecated / Superseded (by ADR-023)Never updating status when decisions change
ContextWhat forces are at play? What constraints exist? Write as if explaining to someone not in the room.Writing context as justification for a decision already made
DecisionThe specific choice, active voice. "We will use PostgreSQL 15 with read replicas."Describing the process of deciding rather than the decision
ConsequencesWhat becomes easier? What becomes harder? What risks does this introduce?Only listing positive consequences
Alternatives ConsideredOther options evaluated and why they were rejectedOmitting this section entirely
Note: The "Alternatives Considered" section is the most valuable part for future readers. Without it, a new engineer cannot know that an option was already evaluated and rejected. They may spend weeks investigating something already dismissed.

Architecture Review Boards

An ARB is a governance forum where significant architectural decisions are reviewed before implementation.

flowchart TD A["Architect proposes decision"] --> B{"Meets trigger criteria?"} B -- "No (routine change)" --> C["Proceed with ADR only"] B -- "Yes" --> D["Submit to ARB"] D --> E["ARB Review — Presentation and Q&A"] E --> F{"ARB Decision"} F -- "Approved" --> G["Proceed — ADR marked Accepted"] F -- "Conditions" --> H["Address conditions then proceed"] F -- "Rejected" --> I["Document rejection, explore alternatives"]

ARB trigger criteria: New technology adoption, cross-team architectural changes, changes to shared infrastructure, decisions with significant security or cost implications, and any decision where the cost of being wrong is high.

Case Study — Amazon's API Mandate

In 2002, Amazon was struggling with a codebase so deeply interconnected that teams could not work independently. Jeff Bezos issued the "API Mandate": all teams must expose their data and functionality through service interfaces; teams must communicate only through these interfaces, with no shared databases and no direct linking; all interfaces must be designed to be externally exposable.

This was a governance decision, not just an architectural one. It changed how teams were evaluated and how data was accessed. The mandate created the foundation for what eventually became Amazon Web Services — APIs built for internal use turned out to be sellable externally because they were designed with strict interface contracts from the beginning.

Practice Exercise (30 minutes): Write a complete ADR for a real architectural decision from your current or most recent project. Write all six sections. Write at least two rejected alternatives with reasons. Then answer: "If someone reads this ADR in three years, what context might they be missing?"

Module 04 — Cloud Readiness

There is a critical distinction between an application that runs in the cloud and one that is cloud-native. Running in the cloud means you have moved your virtual machines from a data centre to AWS. Cloud-native means your application is designed to exploit cloud capabilities: elastic scaling, managed services, pay-per-use cost models, and geographic distribution.

The 6 Rs of Cloud Migration

flowchart TD START["Assess Workload"] --> Q1{"Still needed?"} Q1 -- "No" --> RETIRE["Retire — Decommission"] Q1 -- "Yes" --> Q2{"Regulatory or technical blocker?"} Q2 -- "Yes" --> RETAIN["Retain — Keep on-premise"] Q2 -- "No" --> Q3{"Better SaaS alternative?"} Q3 -- "Yes" --> REPURCHASE["Repurchase — Move to SaaS"] Q3 -- "No" --> Q4{"Time constraint?"} Q4 -- "Urgent (weeks)" --> REHOST["Rehost — Lift and Shift"] Q4 -- "Medium (months)" --> REPLATFORM["Replatform — Lift and Reshape"] Q4 -- "Flexible (quarters)" --> REFACTOR["Refactor — Re-architect cloud-native"]
StrategyDescriptionEffortCloud Benefit
Rehost (Lift & Shift)Move as-is to cloud VMs, no code changesLowMinimal
Replatform (Lift & Reshape)Small optimisations, e.g. move to managed RDSLow–MediumModerate
RepurchaseReplace entirely with a SaaS productMediumHigh
Refactor / Re-architectRedesign as cloud-nativeHighVery High
RetireDecommission — application no longer neededLowN/A
RetainKeep on-premise — regulatory or risk reasonsNoneNone

The 12-Factor App

Authored by Adam Wiggins (Heroku co-founder) and published in 2011. The four most architecturally significant factors:

FactorWhat It MeansArchitectural Impact
III. Config in environmentNo hardcoded config. Use environment variables or a config service.Enables deploying the same artefact to dev, staging, and prod
VI. Stateless processesNo persistent data in memory between requests. State lives in backing services.Makes horizontal scaling possible — any instance can serve any request
IX. DisposabilityProcesses start fast and shut down gracefully.Enables auto-scaling and zero-downtime deployments
XI. Logs as event streamsApplications write logs to stdout; infrastructure handles aggregation.Application has no knowledge of log destination — deployable anywhere

Cloud-Native Resilience Patterns

Circuit Breaker

stateDiagram-v2 [*] --> Closed : Initial state Closed --> Open : Failure threshold exceeded Open --> HalfOpen : Timeout period elapsed HalfOpen --> Closed : Test request succeeds HalfOpen --> Open : Test request fails Closed : CLOSED — requests pass through normally Open : OPEN — requests fail immediately, fallback returned HalfOpen : HALF-OPEN — one test request allowed

Other Key Patterns

Case Study — Airbnb's Migration from Monolith

By 2017, Airbnb's Rails monolith had grown to millions of lines of code. Build times exceeded 45 minutes. Before migrating, engineers assessed the monolith against cloud-native principles and found critical violations: session state in server memory (violating Factor VI), configuration embedded in YAML files committed to version control (violating Factor III), startup time of 8 minutes (violating Factor IX), and logs written to local disk (violating Factor XI).

Airbnb addressed the 12-Factor violations first (sessions moved to Redis, config moved to Consul), then began extracting services at natural domain boundaries. The search service was extracted first because it had the clearest data boundary and the highest independent scaling need.

Practice Exercise (30 minutes): Choose a system you know. Apply the 6 Rs. Check the system against the four most important 12-Factor principles. Where does it fall short? Identify the single most important cloud-native pattern it is missing. Estimate the migration effort using T-shirt sizing.

Module 05 — Estimation

The goal of estimation is not accuracy. The goal is calibrated confidence — communicating what you know, what you do not know, and how confident you are, in terms a business stakeholder can act on.

T-Shirt Sizing

SizeApproximate EffortCharacteristicsExamples
XS1–3 daysWell-understood, single developer, no dependenciesAdding a field to an API response
S1–2 weeksClear scope, single team, some implementation unknownsNew API endpoint; integration with a known SDK
M2–6 weeksMultiple components, cross-team coordination, meaningful unknownsNew microservice extraction; auth system replacement
L2–4 monthsSignificant complexity, multiple teams, unclear scope areasCloud migration of an application; new event-driven pipeline
XL4+ monthsHigh complexity, many unknowns — break into phasesFull microservices migration; new product platform
Note: An XL estimate is a signal, not a commitment. The right response is: "How do we break this into deliverable phases where each phase provides business value?"

Three-Point Estimation (PERT)

Program Evaluation and Review Technique gives a statistically defensible expected value while explicitly modelling uncertainty.

Expected = (O + 4M + P) / 6
Std Dev  = (P - O) / 6

Worked Example — OAuth 2.0 Migration:

ScenarioWeeksReasoning
Optimistic (O)4Library works first time, no legacy edge cases, team has done this before
Most Likely (M)72 weeks discovery, 3 weeks implementation, 2 weeks testing and fixes
Pessimistic (P)14Legacy session handling deeply embedded, clients need coordinated migration, security review required
Expected = (4 + 4×7 + 14) / 6 = 46 / 6 ≈ 7.7 weeks
Std Dev  = (14 - 4) / 6 ≈ 1.7 weeks
68% confidence interval: 6 to 9.5 weeks

Cone of Uncertainty

The earlier in a project you estimate, the wider the range of error. This is not a failure of skill — it is a mathematical reality. Estimation error decreases as a project progresses and more information is known.

xychart-beta title "Estimation Error Decreases Over Project Lifecycle" x-axis ["Initial Concept", "Requirements Complete", "Design Complete", "Code Complete"] y-axis "Estimation Error (%)" 0 --> 400 bar [350, 75, 30, 10]
Warning — The Commitment Trap: The most dangerous moment in estimation is when a stakeholder receives a wide early-phase estimate and asks you to "just give a number for planning purposes." The number you give becomes the plan. Caveats evaporate. "I cannot give you a number more accurate than X until we complete the discovery phase" is one of the most important sentences a Technical Architect can say.

Case Study — Healthcare.gov Launch Failure

Healthcare.gov, the US federal health insurance marketplace, launched in October 2013 and immediately failed. On its first day, only 6 people successfully enrolled despite 250,000 visitors. Load estimates were based on optimistic scenarios. The database was not load-tested until two weeks before launch. Integration testing of the 55 separate agencies' systems was not completed until the final weeks.

The most critical failure was in integration complexity estimation. Integration complexity does not scale linearly: two systems have one integration point, three systems have three, ten systems have 45 possible interaction points. This was not adequately accounted for.

Practice Exercise (25 minutes): Estimate an OAuth 2.0 migration where 40 client applications consume the current LDAP-based system and the team has never implemented OAuth 2.0 before. Document your O/M/P scenarios, top 3 assumptions, top 3 risks, and what a 2-week discovery phase should investigate.

Module 06 — GenAI Architecture

Large Language Models are not deterministic systems. Send the same prompt twice and you may receive different responses. Ask them a factual question and they may answer confidently with fabricated information. Integrating LLMs into production systems requires patterns specifically designed to manage non-determinism, latency, cost, and trust.

Retrieval Augmented Generation (RAG)

RAG solves the fundamental problem that LLMs know nothing about your organisation's specific documents, policies, or internal knowledge. RAG retrieves relevant context from your own data store and provides it to the LLM as part of the prompt.

Query-Time Flow

flowchart TD UQ["User Query"] --> EM["Embedding Model — convert query to vector"] EM --> VS["Vector Database — semantic similarity search"] VS -- "Top-K relevant chunks" --> PC["Prompt Constructor\nSystem prompt + Context + Query"] PC --> LLM["LLM (GPT-4 / Claude / Gemini)"] LLM --> RV["Response Validator — check for hallucination, PII"] RV --> USER["Response to User"]

Ingestion Pipeline (Offline)

flowchart LR DOCS["Source Documents\nPDF, Word, Confluence"] --> LOAD["Document Loader\nExtract text"] LOAD --> CHUNK["Chunker\nSplit into overlapping pieces"] CHUNK --> EMBED["Embedding Model\nConvert to vectors"] EMBED --> INDEX["Vector Index\nStore with metadata"] INDEX --> FRESH["Freshness Manager\nHandle updates and re-indexing"]

Agents and Tool Use

Pure RAG is a read-only pattern. Agents extend this by allowing the LLM to call functions, query APIs, execute code, and take real-world actions. This dramatically increases capability and risk simultaneously.

sequenceDiagram participant U as User participant A as Agent (LLM) participant T as Tools / APIs U->>A: "Book a meeting with the marketing team for Tuesday 2pm" A->>A: Reason: Need to find marketing team members A->>T: get_team_members(team="marketing") T-->>A: ["alice@co.com", "bob@co.com", "carol@co.com"] A->>A: Reason: Check availability A->>T: check_calendar(attendees=[...], time="Tuesday 14:00") T-->>A: { available: true } A->>T: create_event(title="Marketing Meeting", ...) T-->>A: { event_id: "evt_123", status: "created" } A->>U: "Meeting booked with Alice, Bob, and Carol for Tuesday at 2pm."
Warning: Design every agentic system with explicit action allowlists, confirmation steps before irreversible actions, rate limiting per user and per action type, and comprehensive logging. A malicious user could inject instructions through a document the agent reads (prompt injection), overriding your system prompt.

Risks and Mitigations

RiskDescriptionArchitectural Mitigation
HallucinationLLM generates plausible-sounding but factually incorrect informationUse RAG to ground responses in verified documents. Build evaluation harnesses with golden datasets.
LatencyLLM API calls take 1–30 secondsStreaming responses, async processing, response caching, smaller models for latency-sensitive paths.
CostLLMs charge per token; complex prompts are expensive at scaleToken budget management, prompt caching, smaller models for classification tasks, monitor cost per request.
Data PrivacyUser data sent to third-party LLM APIs may violate GDPR or HIPAAPII detection and redaction before sending to external APIs. Audit logs of what data was sent.
Prompt InjectionMalicious users embed instructions that override system promptsInput sanitisation, privilege separation, output validation, anomaly monitoring.
Model Version DriftProviders update models without notice, changing behaviourPin to specific model versions. Maintain regression test suites.
Non-DeterminismSame prompt produces different outputsSet temperature to 0 for deterministic tasks. Build probabilistic evaluation harnesses.

Case Study — GitHub Copilot

Code suggestions that arrive more than 200ms after the user pauses typing feel unresponsive. Copilot uses a purpose-trained model rather than a general-purpose model — a smaller, faster model achieves quality competitive with larger models on the specific task of code completion. The prompt is carefully engineered to include only the most relevant context (surrounding code, related files, open tabs), keeping token counts, and therefore latency and cost, bounded.

GitHub cannot test Copilot by checking if generated code is "correct" — correctness is domain-specific. Instead they evaluate on proxy metrics: acceptance rate (did the developer accept the suggestion?), persistence rate (is the accepted suggestion still in the code 30 seconds later?), and latency distribution (what percentage of suggestions arrive within 200ms?).

Practice Exercise (35 minutes): Design a RAG-based internal knowledge assistant for a 5,000-person company. Produce: a C4 Level 2 container diagram; the top 5 NFRs with SMART metrics; the top 3 risks with specific mitigations; an explanation of how you would handle document freshness; and a cost model.

Artifact Templates

Architecture Proposal Template

SectionContentLength
1. Executive SummaryProblem, proposed solution, key trade-offs1 paragraph
2. Problem StatementBusiness problem, who is affected, constraints1–2 paragraphs
3. Assumptions & ConstraintsAll assumptions listed explicitlyBulleted list
4. Proposed ArchitectureC4 Level 1 + Level 2 diagrams, one paragraph per major componentDiagrams + 1 page
5. Key Decisions3–5 decisions in mini-ADR format½ page per decision
6. NFR CoverageTop 5 NFRs with SMART metrics and how the architecture addresses eachTable
7. Risks & MitigationsTop 3–5 risks with likelihood, impact (H/M/L), and specific mitigationTable
8. Open QuestionsWhat would you need to know to finalise this design?Bulleted list

NFR Specification Template

AttributeContent
CategoryPerformance / Scalability / Availability / Security / Observability / etc.
StatementThe specific, measurable requirement
Measurement MethodHow this will be verified — load test, security scan, uptime monitor
Acceptance ThresholdPass/fail criteria with specific numbers
PriorityMust Have / Should Have / Nice to Have
Architectural ImpactWhich architectural decisions this NFR drives

Glossary

TermDefinitionModule
ADRArchitecture Decision Record — a document capturing a single architectural decision, its context, alternatives, and consequences03
ARBArchitecture Review Board — a governance forum for reviewing significant architectural decisions03
AvailabilityThe proportion of time a system is operational, expressed as a percentage02
Bounded ContextA logical boundary within which a domain model is internally consistent01
C4 ModelA hierarchical diagramming approach with four levels: Context, Container, Component, Code01
Circuit BreakerA pattern that prevents cascade failures by stopping calls to a failing service after a threshold04
CohesionThe degree to which elements within a component belong together. High cohesion is desirable.01
CouplingThe degree of dependency between components. Low coupling is desirable.01
EmbeddingA numerical vector representation of text capturing semantic meaning06
HallucinationAn LLM generating plausible-sounding but factually incorrect information06
NFRNon-Functional Requirement — a requirement specifying how well a system performs its functions02
PERTProgram Evaluation and Review Technique — three-point estimation using O, M, and P estimates05
Prompt InjectionAn attack where malicious input overrides an LLM system prompt06
RAGRetrieval Augmented Generation — grounds LLM responses in retrieved documents06
RPORecovery Point Objective — the maximum acceptable data loss measured in time02
RTORecovery Time Objective — the maximum acceptable duration of downtime after a failure02
Strangler Fig PatternMigrating a legacy system by gradually extracting components over time01
T-Shirt SizingRelative estimation using XS/S/M/L/XL categories05
12-Factor AppA methodology defining 12 characteristics of applications designed for cloud deployment, authored by Adam Wiggins in 201104
Vector DatabaseA database optimised for storing and querying embedding vectors06

Indrajith's — Reference Documents — April 2026