Architecture Guide

This guide explains the internal architecture of GeoTessera and how the various components work together to provide efficient access to Tessera embeddings.

Overview

GeoTessera is designed around a simple but powerful architecture that optimizes for:

  • Efficient data access: Only download what you need

  • Projection preservation: Maintain native UTM projections for accuracy

  • Scalability: Handle large datasets with lazy loading

  • Flexibility: Support both analysis and GIS workflows

  • Reliability: Ensure data integrity with checksums

Core Architecture

The library follows a layered architecture:

User Interface Layer
├── CLI Commands (geotessera download, visualize, etc.)
└── Python API (GeoTessera class)
        ↓
Core Processing Layer
├── GeoTessera class (main interface)
├── Registry (Parquet-based data discovery)
└── Visualization (rendering and web maps)
        ↓
Data Access Layer
├── Anonymous S3 downloads (botocore)
├── Zarr v3 store (cloud-native streaming)
├── Rasterio (GeoTIFF I/O)
└── GeoPandas (geospatial operations)
        ↓
Storage Layer
├── Remote servers (https://s3.us-west-2.amazonaws.com/tessera-embeddings)
├── Zarr store (https://s3.us-west-2.amazonaws.com/tessera-embeddings/v1/zarr)
└── Local cache (~/.cache/geotessera/{v1,v1.1}/manifest.parquet)

Coordinate System and Grid

Understanding the Tessera Grid

The Tessera embeddings are organized on a 0.1-degree grid system:

Grid Properties:

  • Grid spacing: 0.1° latitude × 0.1° longitude

  • Tile naming: Named by center coordinates (e.g., grid_0.15_52.05)

  • Coverage: Each tile spans from (center - 0.05°) to (center + 0.05°)

  • Resolution: Approximately 11km × 11km at the equator

Coordinate Calculations:

# For a tile at center coordinates (lon, lat)
west = lon - 0.05
east = lon + 0.05
south = lat - 0.05
north = lat + 0.05

Grid Alignment:

Tile centers are aligned to 0.1-degree boundaries:

# Valid tile centers (examples)
valid_centers = [
    (0.05, 52.05),   # Northwest Europe
    (0.15, 52.05),   # Adjacent tile
    (-0.05, 51.95),  # Southwest tile
]

# Invalid centers (not on grid)
invalid_centers = [
    (0.07, 52.03),   # Off-grid
    (0.1, 52.1),     # Off by 0.05°
]

Resolution and Pixel Density

The number of pixels per tile varies with latitude due to the Earth’s curvature:

import math

def pixels_per_tile(latitude, resolution_meters=10):
    """Calculate approximate pixels per tile at given latitude."""
    # Earth circumference at equator (meters)
    earth_circumference = 40075000

    # Degrees per meter at equator
    degrees_per_meter = 360 / earth_circumference

    # Adjust for latitude (longitude only)
    lon_degrees_per_meter = degrees_per_meter / math.cos(math.radians(latitude))
    lat_degrees_per_meter = degrees_per_meter

    # Tile size in meters
    tile_width_meters = 0.1 / lon_degrees_per_meter
    tile_height_meters = 0.1 / lat_degrees_per_meter

    # Pixels in tile
    pixels_width = int(tile_width_meters / resolution_meters)
    pixels_height = int(tile_height_meters / resolution_meters)

    return pixels_width, pixels_height

# Examples
eq_pixels = pixels_per_tile(0)      # ~(1111, 1111) at equator
uk_pixels = pixels_per_tile(52)     # ~(1823, 1111) in UK
arctic_pixels = pixels_per_tile(80) # ~(6389, 1111) near poles

Data Format and Storage

Quantization System

Tessera embeddings are stored using a quantization system for efficiency:

Storage Format:

  1. Quantized embeddings (grid_X.XX_Y.YY.npy):

    • Data type: int8 (values -128 to 127)

    • Shape: (height, width, 128)

    • Storage efficient: ~1MB per tile vs ~64MB unquantized

  2. Scale factors (grid_X.XX_Y.YY_scales.npy):

    • Data type: float32

    • Shape: (height, width) or (height, width, 128)

    • Contains dequantization multipliers

Dequantization Process:

import numpy as np

# Load quantized data and scales
quantized = np.load("grid_0.15_52.05.npy")         # int8
scales = np.load("grid_0.15_52.05_scales.npy")     # float32

# Dequantize
if scales.ndim == 2:
    # Broadcast 2D scales to 3D
    scales = scales[..., np.newaxis]

embedding = quantized.astype(np.float32) * scales

# Result: (height, width, 128) float32 array

This process is handled automatically by GeoTessera.fetch_embedding(), which now returns the dequantized embedding along with CRS and transform information from the corresponding landmask tile.

Metadata and Projections

Landmask Files (grid_X.XX_Y.YY.tiff):

  • Provide native UTM projection information for each tile

  • Define precise geospatial transforms (no reprojection needed)

  • Preserve original coordinate system for maximum accuracy

  • Used for georeferencing when exporting to GeoTIFF

  • Contain binary land/water masks

Projection Selection:

Each tile uses an appropriate UTM zone based on its location:

def get_utm_zone(longitude):
    """Get UTM zone number for a longitude."""
    return int((longitude + 180) / 6) + 1

def get_utm_epsg(longitude, latitude):
    """Get EPSG code for UTM projection."""
    zone = get_utm_zone(longitude)

    if latitude >= 0:
        # Northern hemisphere
        return f"EPSG:{32600 + zone}"
    else:
        # Southern hemisphere
        return f"EPSG:{32700 + zone}"

# Example: London at (0.15, 52.05)
epsg = get_utm_epsg(0.15, 52.05)  # "EPSG:32631" (UTM Zone 31N)

Registry System

Parquet-Based Registry

The registry uses one Parquet manifest per dataset version for efficient data discovery and querying. The manifest is filtered by (version, variant) at load time, then queried by lat/lon/year:

Manifest Structure (s3://tessera-embeddings/{v1,v1.1}/manifest.parquet):

manifest.parquet (one per dataset version)
├── Columns:
│   ├── version          # Normalised dataset version ('1.0', '1.1')
│   ├── variant          # Dataset variant ('vultr', 'cambridge', ...)
│   ├── lon, lat         # Tile center coordinates
│   ├── year             # Data year
│   ├── grid_size        # Embedding NPY byte size
│   ├── scales_size      # Scales NPY byte size
│   ├── grid_path        # Full s3:// URI of the embedding
│   ├── scales_path      # Full s3:// URI of the scales
│   ├── grid_mtime       # Object mtime on S3
│   └── scales_mtime
└── Rows: One per (version, variant, year, lon, lat) tile

Integrity is not carried in the manifest — it’s verified end-to-end at download time against S3’s x-amz-checksum-crc64nvme response header.

Querying the Manifest:

import pandas as pd

df = pd.read_parquet("manifest.parquet")  # the v1.1 file
# Filter to (version, variant) you want — the manifest can carry both
df = df[(df['version'] == '1.1') & (df['variant'] == 'cambridge')]

# Query tiles in a region
bbox = (-0.2, 51.4, 0.1, 51.6)  # (min_lon, min_lat, max_lon, max_lat)
tiles = df[
    (df['lon'] >= bbox[0]) & (df['lon'] <= bbox[2]) &
    (df['lat'] >= bbox[1]) & (df['lat'] <= bbox[3]) &
    (df['year'] == 2024)
]

Manifest Loading Process:

  1. Download per-version manifest (only the one matching dataset_version)

  2. Filter to the requested (version, variant)

  3. Materialise Point geometry from lon/lat (R-tree spatial index)

  4. Set MultiIndex on (year, lon_i, lat_i) for O(1) lookups

  5. Cache the parquet at ~/.cache/geotessera/{v1,v1.1}/manifest.parquet with an ETag sidecar for conditional GETs

Manifest Sources

The manifest can be loaded from multiple sources:

1. Default Remote (recommended):

# Downloads and caches the v1 manifest automatically.
from geotessera import GeoTessera
gt = GeoTessera()

# Cached at: ~/.cache/geotessera/v1/manifest.parquet

# Pick a different (version, variant) — see :ref:`dataset-versions`
gt = GeoTessera(dataset_version="v1.1", dataset_variant="cambridge")

2. Local File:

gt = GeoTessera(registry_path="/path/to/manifest.parquet")

3. Local Directory:

# Looks for manifest.parquet in the directory (also accepts the legacy
# registry.parquet name for backward compat).
gt = GeoTessera(registry_dir="/path/to/manifest-dir")

4. Custom URL:

gt = GeoTessera(registry_url="https://example.com/manifest.parquet")

5. CLI Option:

geotessera download --cache-dir /custom/cache ...

Data Access Layer

S3 Downloads

GeoTessera streams tiles directly from the public S3 bucket using anonymous (unsigned) botocore requests:

Features:

  • Per-output-dir mirroring: Tiles land in the user-supplied --output directory and persist there for re-use across runs

  • Integrity checking: End-to-end CRC64NVMe verified against S3’s x-amz-checksum-crc64nvme response header during the body stream

  • Conditional caching: Per-version manifests use If-None-Match / ETag sidecars so unchanged manifests yield a 304 with zero body

  • Progress callbacks: Real-time download feedback with speed and size info

  • Resumable: Existing files in the output dir are skipped on rerun

Cache Structure:

~/.cache/geotessera/
├── v1/
│   ├── manifest.parquet           # Per-version tile manifest
│   ├── manifest.parquet.etag      # HTTP ETag for conditional GETs
│   ├── landmasks.parquet
│   └── landmasks.parquet.etag
└── v1.1/
    ├── manifest.parquet
    ├── manifest.parquet.etag
    ├── landmasks.parquet
    └── landmasks.parquet.etag

# Embedding/landmark tile data lives in the user's --output dir,
# not in this cache. The cache holds only per-version manifests.

Download Process:

import numpy as np
from geotessera import dequantize_embedding

def fetch_embedding(lon, lat, year):
    # 1. Fetch the quantized embedding and scales tiles. ``fetch`` returns
    #    a path under embeddings_dir, downloading from S3 if not present.
    embedding_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=False)
    scales_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=True)

    # 2. Load and dequantize
    quantized = np.load(embedding_file)
    scales = np.load(scales_file)
    embedding = dequantize_embedding(quantized, scales)

    # 3. Get CRS from the landmask tile
    crs, transform = get_utm_projection_from_landmask(lon, lat)

    return embedding, crs, transform

Persistent Tile Storage

Why persist tiles?

  • Tiles land in the user-supplied --output (embeddings_dir) and are re-used across runs rather than re-downloaded

  • Existing files are skipped on rerun, making interrupted downloads resumable

  • Only the small per-version manifests live in ~/.cache/geotessera; the bulk embedding data stays under the output directory the user controls

Cache Configuration:

from geotessera import GeoTessera

# Control where registry is cached
gt = GeoTessera(cache_dir="/custom/cache")

# Default cache locations:
# - Linux/macOS: ~/.cache/geotessera/
# - Windows: %LOCALAPPDATA%/geotessera/

GeoTIFF Export Process

When exporting to GeoTIFF, additional processing occurs:

Export Workflow:

  1. Fetch embedding data (quantized + scales)

  2. Fetch landmask tile for projection information

  3. Extract native UTM projection and transform from landmask

  4. Apply dequantization to embedding data

  5. Preserve original coordinate system (no reprojection)

  6. Select bands (if specified)

  7. Write GeoTIFF with native UTM CRS and accurate transform

  8. Apply compression (LZW, DEFLATE, etc.)

Projection Inheritance:

import rasterio

def export_geotiff(embedding, landmask_path, output_path, bands=None):
    # Read projection from landmask
    with rasterio.open(landmask_path) as landmask:
        crs = landmask.crs
        transform = landmask.transform

    # Select bands
    if bands:
        embedding = embedding[:, :, bands]

    # Write GeoTIFF
    with rasterio.open(output_path, 'w',
                      driver='GTiff',
                      height=embedding.shape[0],
                      width=embedding.shape[1],
                      count=embedding.shape[2],
                      dtype=embedding.dtype,
                      crs=crs,
                      transform=transform,
                      compress='lzw') as dst:

        for i in range(embedding.shape[2]):
            dst.write(embedding[:, :, i], i + 1)

Performance Considerations

Memory Management

Large Region Handling:

When processing large regions, GeoTessera uses several strategies:

  • Tile-by-tile processing: Process one tile at a time to limit memory usage

  • Band selection: Only load required bands to reduce memory footprint

  • Generator patterns: Use generators for large tile collections

  • Progress callbacks: Provide feedback for long operations

Example Memory-Efficient Processing:

def process_large_region(bbox, year, bands=None):
    """Process a large region without loading all tiles into memory."""
    gt = GeoTessera()

    # Step 1: Get tile list (metadata only, no data loaded)
    tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=year)

    # Step 2: Process tiles one at a time using generator
    for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
        # Apply band selection early to reduce memory
        if bands:
            embedding = embedding[:, :, bands]

        # Process this tile
        result = process_single_tile(embedding)

        # Save or accumulate results
        save_tile_result(result, tile_lat, tile_lon)

        # Free memory
        del embedding

Network Optimization

Sequential Processing:

The fetch_embeddings() generator processes tiles sequentially, which is optimal for most use cases:

# Sequential processing (recommended for most cases)
gt = GeoTessera()
tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)

# Returns generator - tiles are fetched one at a time
for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
    process_tile(embedding)  # Memory efficient

Point Sampling:

For sampling at specific locations, use the optimized point sampling method:

# Efficient point sampling with automatic tile download
points = [(0.15, 52.05), (0.25, 52.15), (-0.05, 51.55)]
embeddings = gt.sample_embeddings_at_points(points, year=2024)

# With metadata about which tile each point came from
embeddings, metadata = gt.sample_embeddings_at_points(
    points, year=2024, include_metadata=True
)

Cache Efficiency:

  • Pre-warming: Download commonly used tiles in advance

  • Batch processing: Group requests by geographic region

  • Size limits: Respect server rate limits

Zarr Store (Cloud-Native Access)

The GeoTesseraZarr class provides cloud-native access to embeddings without downloading files. It implements the geoemb: convention for geospatial embedding data stored in Zarr v3 format.

Architecture:

Data is organized by UTM zone, with each zone stored as a separate Zarr group. The store automatically routes geographic queries to the correct zone:

zarr store
├── Root attributes (geoemb:model, geoemb:build_version)
├── utm30/           # UTM Zone 30
│   ├── time[:]      # Year coordinate array
│   ├── embedding    # (time, y, x, band) float32
│   └── ...
├── utm31/           # UTM Zone 31
│   └── ...
└── ...

Access patterns:

  • Point sampling: sample_points() / sample_at() for extracting embeddings at specific coordinates across zones

  • Region reading: read_region() for loading rectangular areas as mosaics with CRS and transform metadata

  • Zone access: open_zone() returns an xarray Dataset with a .tessera accessor for direct manipulation

Datasets are cached per zone for the lifetime of the GeoTesseraZarr instance.

Future Extensions

The architecture supports future enhancements:

  • Temporal queries: Multi-year analysis

  • Cloud optimization: Direct cloud storage access

  • ML integration: TensorFlow/PyTorch data loaders

  • Real-time updates: Live data ingestion

  • Distributed processing: Dask/Ray integration