Architecture Guide
=================

This guide explains the internal architecture of GeoTessera and how the various components work together to provide efficient access to Tessera embeddings.

Overview
--------

GeoTessera is designed around a simple but powerful architecture that optimizes for:

- **Efficient data access**: Only download what you need
- **Projection preservation**: Maintain native UTM projections for accuracy
- **Scalability**: Handle large datasets with lazy loading
- **Flexibility**: Support both analysis and GIS workflows
- **Reliability**: Ensure data integrity with checksums

Core Architecture
-----------------

The library follows a layered architecture:

.. code-block::

    User Interface Layer
    ├── CLI Commands (geotessera download, visualize, etc.)
    └── Python API (GeoTessera class)
            ↓
    Core Processing Layer
    ├── GeoTessera class (main interface)
    ├── Registry (Parquet-based data discovery)
    └── Visualization (rendering and web maps)
            ↓
    Data Access Layer
    ├── Anonymous S3 downloads (botocore)
    ├── Zarr v3 store (cloud-native streaming)
    ├── Rasterio (GeoTIFF I/O)
    └── GeoPandas (geospatial operations)
            ↓
    Storage Layer
    ├── Remote servers (https://s3.us-west-2.amazonaws.com/tessera-embeddings)
    ├── Zarr store (https://s3.us-west-2.amazonaws.com/tessera-embeddings/v1/zarr)
    └── Local cache (~/.cache/geotessera/{v1,v1.1}/manifest.parquet)

Coordinate System and Grid
--------------------------

Understanding the Tessera Grid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Tessera embeddings are organized on a **0.1-degree grid system**:

**Grid Properties**:

- **Grid spacing**: 0.1° latitude × 0.1° longitude
- **Tile naming**: Named by center coordinates (e.g., ``grid_0.15_52.05``)
- **Coverage**: Each tile spans from (center - 0.05°) to (center + 0.05°)
- **Resolution**: Approximately 11km × 11km at the equator

**Coordinate Calculations**::

    # For a tile at center coordinates (lon, lat)
    west = lon - 0.05
    east = lon + 0.05  
    south = lat - 0.05
    north = lat + 0.05

**Grid Alignment**:

Tile centers are aligned to 0.1-degree boundaries::

    # Valid tile centers (examples)
    valid_centers = [
        (0.05, 52.05),   # Northwest Europe
        (0.15, 52.05),   # Adjacent tile
        (-0.05, 51.95),  # Southwest tile
    ]
    
    # Invalid centers (not on grid)
    invalid_centers = [
        (0.07, 52.03),   # Off-grid
        (0.1, 52.1),     # Off by 0.05°
    ]

Resolution and Pixel Density
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The number of pixels per tile varies with latitude due to the Earth's curvature:

.. code-block:: python

    import math
    
    def pixels_per_tile(latitude, resolution_meters=10):
        """Calculate approximate pixels per tile at given latitude."""
        # Earth circumference at equator (meters)
        earth_circumference = 40075000
        
        # Degrees per meter at equator
        degrees_per_meter = 360 / earth_circumference
        
        # Adjust for latitude (longitude only)
        lon_degrees_per_meter = degrees_per_meter / math.cos(math.radians(latitude))
        lat_degrees_per_meter = degrees_per_meter
        
        # Tile size in meters
        tile_width_meters = 0.1 / lon_degrees_per_meter
        tile_height_meters = 0.1 / lat_degrees_per_meter
        
        # Pixels in tile
        pixels_width = int(tile_width_meters / resolution_meters)
        pixels_height = int(tile_height_meters / resolution_meters)
        
        return pixels_width, pixels_height
    
    # Examples
    eq_pixels = pixels_per_tile(0)      # ~(1111, 1111) at equator
    uk_pixels = pixels_per_tile(52)     # ~(1823, 1111) in UK
    arctic_pixels = pixels_per_tile(80) # ~(6389, 1111) near poles

Data Format and Storage
-----------------------

Quantization System
~~~~~~~~~~~~~~~~~~~

Tessera embeddings are stored using a quantization system for efficiency:

**Storage Format**:

1. **Quantized embeddings** (``grid_X.XX_Y.YY.npy``):
   
   - Data type: ``int8`` (values -128 to 127)
   - Shape: ``(height, width, 128)``
   - Storage efficient: ~1MB per tile vs ~64MB unquantized

2. **Scale factors** (``grid_X.XX_Y.YY_scales.npy``):
   
   - Data type: ``float32``
   - Shape: ``(height, width)`` or ``(height, width, 128)``
   - Contains dequantization multipliers

**Dequantization Process**::

    import numpy as np
    
    # Load quantized data and scales
    quantized = np.load("grid_0.15_52.05.npy")         # int8
    scales = np.load("grid_0.15_52.05_scales.npy")     # float32
    
    # Dequantize
    if scales.ndim == 2:
        # Broadcast 2D scales to 3D
        scales = scales[..., np.newaxis]
    
    embedding = quantized.astype(np.float32) * scales
    
    # Result: (height, width, 128) float32 array

This process is handled automatically by ``GeoTessera.fetch_embedding()``, which now returns the dequantized embedding along with CRS and transform information from the corresponding landmask tile.

Metadata and Projections
~~~~~~~~~~~~~~~~~~~~~~~~

**Landmask Files** (``grid_X.XX_Y.YY.tiff``):

- Provide native UTM projection information for each tile
- Define precise geospatial transforms (no reprojection needed)
- Preserve original coordinate system for maximum accuracy
- Used for georeferencing when exporting to GeoTIFF
- Contain binary land/water masks

**Projection Selection**:

Each tile uses an appropriate UTM zone based on its location::

    def get_utm_zone(longitude):
        """Get UTM zone number for a longitude."""
        return int((longitude + 180) / 6) + 1
    
    def get_utm_epsg(longitude, latitude):
        """Get EPSG code for UTM projection."""
        zone = get_utm_zone(longitude)
        
        if latitude >= 0:
            # Northern hemisphere
            return f"EPSG:{32600 + zone}"
        else:
            # Southern hemisphere  
            return f"EPSG:{32700 + zone}"
    
    # Example: London at (0.15, 52.05)
    epsg = get_utm_epsg(0.15, 52.05)  # "EPSG:32631" (UTM Zone 31N)

Registry System
---------------

Parquet-Based Registry
~~~~~~~~~~~~~~~~~~~~~~

The registry uses one **Parquet manifest per dataset version** for efficient
data discovery and querying. The manifest is filtered by ``(version,
variant)`` at load time, then queried by lat/lon/year:

**Manifest Structure** (``s3://tessera-embeddings/{v1,v1.1}/manifest.parquet``):

.. code-block::

    manifest.parquet (one per dataset version)
    ├── Columns:
    │   ├── version          # Normalised dataset version ('1.0', '1.1')
    │   ├── variant          # Dataset variant ('vultr', 'cambridge', ...)
    │   ├── lon, lat         # Tile center coordinates
    │   ├── year             # Data year
    │   ├── grid_size        # Embedding NPY byte size
    │   ├── scales_size      # Scales NPY byte size
    │   ├── grid_path        # Full s3:// URI of the embedding
    │   ├── scales_path      # Full s3:// URI of the scales
    │   ├── grid_mtime       # Object mtime on S3
    │   └── scales_mtime
    └── Rows: One per (version, variant, year, lon, lat) tile

Integrity is **not** carried in the manifest — it's verified end-to-end at
download time against S3's ``x-amz-checksum-crc64nvme`` response header.

**Querying the Manifest**::

    import pandas as pd

    df = pd.read_parquet("manifest.parquet")  # the v1.1 file
    # Filter to (version, variant) you want — the manifest can carry both
    df = df[(df['version'] == '1.1') & (df['variant'] == 'cambridge')]

    # Query tiles in a region
    bbox = (-0.2, 51.4, 0.1, 51.6)  # (min_lon, min_lat, max_lon, max_lat)
    tiles = df[
        (df['lon'] >= bbox[0]) & (df['lon'] <= bbox[2]) &
        (df['lat'] >= bbox[1]) & (df['lat'] <= bbox[3]) &
        (df['year'] == 2024)
    ]

**Manifest Loading Process**:

1. **Download per-version manifest** (only the one matching ``dataset_version``)
2. **Filter** to the requested ``(version, variant)``
3. **Materialise Point geometry** from lon/lat (R-tree spatial index)
4. **Set MultiIndex** on ``(year, lon_i, lat_i)`` for O(1) lookups
5. **Cache** the parquet at ``~/.cache/geotessera/{v1,v1.1}/manifest.parquet``
   with an ETag sidecar for conditional GETs

Manifest Sources
~~~~~~~~~~~~~~~~

The manifest can be loaded from multiple sources:

**1. Default Remote** (recommended)::

    # Downloads and caches the v1 manifest automatically.
    from geotessera import GeoTessera
    gt = GeoTessera()

    # Cached at: ~/.cache/geotessera/v1/manifest.parquet

    # Pick a different (version, variant) — see :ref:`dataset-versions`
    gt = GeoTessera(dataset_version="v1.1", dataset_variant="cambridge")

**2. Local File**::

    gt = GeoTessera(registry_path="/path/to/manifest.parquet")

**3. Local Directory**::

    # Looks for manifest.parquet in the directory (also accepts the legacy
    # registry.parquet name for backward compat).
    gt = GeoTessera(registry_dir="/path/to/manifest-dir")

**4. Custom URL**::

    gt = GeoTessera(registry_url="https://example.com/manifest.parquet")

**5. CLI Option**::

    geotessera download --cache-dir /custom/cache ...

Data Access Layer
-----------------

S3 Downloads
~~~~~~~~~~~~

GeoTessera streams tiles directly from the public S3 bucket using anonymous
(unsigned) ``botocore`` requests:

**Features**:

- **Per-output-dir mirroring**: Tiles land in the user-supplied
  ``--output`` directory and persist there for re-use across runs
- **Integrity checking**: End-to-end CRC64NVMe verified against S3's
  ``x-amz-checksum-crc64nvme`` response header during the body stream
- **Conditional caching**: Per-version manifests use ``If-None-Match`` /
  ``ETag`` sidecars so unchanged manifests yield a 304 with zero body
- **Progress callbacks**: Real-time download feedback with speed and size info
- **Resumable**: Existing files in the output dir are skipped on rerun

**Cache Structure**::

    ~/.cache/geotessera/
    ├── v1/
    │   ├── manifest.parquet           # Per-version tile manifest
    │   ├── manifest.parquet.etag      # HTTP ETag for conditional GETs
    │   ├── landmasks.parquet
    │   └── landmasks.parquet.etag
    └── v1.1/
        ├── manifest.parquet
        ├── manifest.parquet.etag
        ├── landmasks.parquet
        └── landmasks.parquet.etag

    # Embedding/landmark tile data lives in the user's --output dir,
    # not in this cache. The cache holds only per-version manifests.

**Download Process**::

    import numpy as np
    from geotessera import dequantize_embedding

    def fetch_embedding(lon, lat, year):
        # 1. Fetch the quantized embedding and scales tiles. ``fetch`` returns
        #    a path under embeddings_dir, downloading from S3 if not present.
        embedding_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=False)
        scales_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=True)

        # 2. Load and dequantize
        quantized = np.load(embedding_file)
        scales = np.load(scales_file)
        embedding = dequantize_embedding(quantized, scales)

        # 3. Get CRS from the landmask tile
        crs, transform = get_utm_projection_from_landmask(lon, lat)

        return embedding, crs, transform

Persistent Tile Storage
~~~~~~~~~~~~~~~~~~~~~~~~

**Why persist tiles?**

- Tiles land in the user-supplied ``--output`` (``embeddings_dir``) and are
  re-used across runs rather than re-downloaded
- Existing files are skipped on rerun, making interrupted downloads resumable
- Only the small per-version manifests live in ``~/.cache/geotessera``;
  the bulk embedding data stays under the output directory the user controls

**Cache Configuration**::

    from geotessera import GeoTessera

    # Control where registry is cached
    gt = GeoTessera(cache_dir="/custom/cache")

    # Default cache locations:
    # - Linux/macOS: ~/.cache/geotessera/
    # - Windows: %LOCALAPPDATA%/geotessera/

GeoTIFF Export Process
~~~~~~~~~~~~~~~~~~~~~~

When exporting to GeoTIFF, additional processing occurs:

**Export Workflow**:

1. **Fetch embedding data** (quantized + scales)
2. **Fetch landmask tile** for projection information  
3. **Extract native UTM projection** and transform from landmask
4. **Apply dequantization** to embedding data
5. **Preserve original coordinate system** (no reprojection)
6. **Select bands** (if specified)
7. **Write GeoTIFF** with native UTM CRS and accurate transform
8. **Apply compression** (LZW, DEFLATE, etc.)

**Projection Inheritance**::

    import rasterio
    
    def export_geotiff(embedding, landmask_path, output_path, bands=None):
        # Read projection from landmask
        with rasterio.open(landmask_path) as landmask:
            crs = landmask.crs
            transform = landmask.transform
            
        # Select bands
        if bands:
            embedding = embedding[:, :, bands]
            
        # Write GeoTIFF
        with rasterio.open(output_path, 'w',
                          driver='GTiff',
                          height=embedding.shape[0],
                          width=embedding.shape[1], 
                          count=embedding.shape[2],
                          dtype=embedding.dtype,
                          crs=crs,
                          transform=transform,
                          compress='lzw') as dst:
            
            for i in range(embedding.shape[2]):
                dst.write(embedding[:, :, i], i + 1)

Performance Considerations
--------------------------

Memory Management
~~~~~~~~~~~~~~~~~

**Large Region Handling**:

When processing large regions, GeoTessera uses several strategies:

- **Tile-by-tile processing**: Process one tile at a time to limit memory usage
- **Band selection**: Only load required bands to reduce memory footprint  
- **Generator patterns**: Use generators for large tile collections
- **Progress callbacks**: Provide feedback for long operations

**Example Memory-Efficient Processing**::

    def process_large_region(bbox, year, bands=None):
        """Process a large region without loading all tiles into memory."""
        gt = GeoTessera()

        # Step 1: Get tile list (metadata only, no data loaded)
        tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=year)

        # Step 2: Process tiles one at a time using generator
        for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
            # Apply band selection early to reduce memory
            if bands:
                embedding = embedding[:, :, bands]

            # Process this tile
            result = process_single_tile(embedding)

            # Save or accumulate results
            save_tile_result(result, tile_lat, tile_lon)

            # Free memory
            del embedding

Network Optimization
~~~~~~~~~~~~~~~~~~~~

**Sequential Processing**:

The fetch_embeddings() generator processes tiles sequentially, which is optimal for most use cases::

    # Sequential processing (recommended for most cases)
    gt = GeoTessera()
    tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)

    # Returns generator - tiles are fetched one at a time
    for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
        process_tile(embedding)  # Memory efficient

**Point Sampling**:

For sampling at specific locations, use the optimized point sampling method::

    # Efficient point sampling with automatic tile download
    points = [(0.15, 52.05), (0.25, 52.15), (-0.05, 51.55)]
    embeddings = gt.sample_embeddings_at_points(points, year=2024)

    # With metadata about which tile each point came from
    embeddings, metadata = gt.sample_embeddings_at_points(
        points, year=2024, include_metadata=True
    )

**Cache Efficiency**:

- **Pre-warming**: Download commonly used tiles in advance
- **Batch processing**: Group requests by geographic region
- **Size limits**: Respect server rate limits

Zarr Store (Cloud-Native Access)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``GeoTesseraZarr`` class provides cloud-native access to embeddings
without downloading files. It implements the ``geoemb:`` convention for
geospatial embedding data stored in Zarr v3 format.

**Architecture**:

Data is organized by UTM zone, with each zone stored as a separate Zarr
group. The store automatically routes geographic queries to the correct zone::

    zarr store
    ├── Root attributes (geoemb:model, geoemb:build_version)
    ├── utm30/           # UTM Zone 30
    │   ├── time[:]      # Year coordinate array
    │   ├── embedding    # (time, y, x, band) float32
    │   └── ...
    ├── utm31/           # UTM Zone 31
    │   └── ...
    └── ...

**Access patterns**:

- **Point sampling**: ``sample_points()`` / ``sample_at()`` for extracting
  embeddings at specific coordinates across zones
- **Region reading**: ``read_region()`` for loading rectangular areas as
  mosaics with CRS and transform metadata
- **Zone access**: ``open_zone()`` returns an xarray Dataset with a
  ``.tessera`` accessor for direct manipulation

Datasets are cached per zone for the lifetime of the ``GeoTesseraZarr``
instance.

Future Extensions
~~~~~~~~~~~~~~~~~

The architecture supports future enhancements:

- **Temporal queries**: Multi-year analysis
- **Cloud optimization**: Direct cloud storage access
- **ML integration**: TensorFlow/PyTorch data loaders
- **Real-time updates**: Live data ingestion
- **Distributed processing**: Dask/Ray integration