Architecture Guide ================= This guide explains the internal architecture of GeoTessera and how the various components work together to provide efficient access to Tessera embeddings. Overview -------- GeoTessera is designed around a simple but powerful architecture that optimizes for: - **Efficient data access**: Only download what you need - **Projection preservation**: Maintain native UTM projections for accuracy - **Scalability**: Handle large datasets with lazy loading - **Flexibility**: Support both analysis and GIS workflows - **Reliability**: Ensure data integrity with checksums Core Architecture ----------------- The library follows a layered architecture: .. code-block:: User Interface Layer ├── CLI Commands (geotessera download, visualize, etc.) └── Python API (GeoTessera class) ↓ Core Processing Layer ├── GeoTessera class (main interface) ├── Registry (Parquet-based data discovery) └── Visualization (rendering and web maps) ↓ Data Access Layer ├── Anonymous S3 downloads (botocore) ├── Zarr v3 store (cloud-native streaming) ├── Rasterio (GeoTIFF I/O) └── GeoPandas (geospatial operations) ↓ Storage Layer ├── Remote servers (https://s3.us-west-2.amazonaws.com/tessera-embeddings) ├── Zarr store (https://s3.us-west-2.amazonaws.com/tessera-embeddings/v1/zarr) └── Local cache (~/.cache/geotessera/{v1,v1.1}/manifest.parquet) Coordinate System and Grid -------------------------- Understanding the Tessera Grid ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Tessera embeddings are organized on a **0.1-degree grid system**: **Grid Properties**: - **Grid spacing**: 0.1° latitude × 0.1° longitude - **Tile naming**: Named by center coordinates (e.g., ``grid_0.15_52.05``) - **Coverage**: Each tile spans from (center - 0.05°) to (center + 0.05°) - **Resolution**: Approximately 11km × 11km at the equator **Coordinate Calculations**:: # For a tile at center coordinates (lon, lat) west = lon - 0.05 east = lon + 0.05 south = lat - 0.05 north = lat + 0.05 **Grid Alignment**: Tile centers are aligned to 0.1-degree boundaries:: # Valid tile centers (examples) valid_centers = [ (0.05, 52.05), # Northwest Europe (0.15, 52.05), # Adjacent tile (-0.05, 51.95), # Southwest tile ] # Invalid centers (not on grid) invalid_centers = [ (0.07, 52.03), # Off-grid (0.1, 52.1), # Off by 0.05° ] Resolution and Pixel Density ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The number of pixels per tile varies with latitude due to the Earth's curvature: .. code-block:: python import math def pixels_per_tile(latitude, resolution_meters=10): """Calculate approximate pixels per tile at given latitude.""" # Earth circumference at equator (meters) earth_circumference = 40075000 # Degrees per meter at equator degrees_per_meter = 360 / earth_circumference # Adjust for latitude (longitude only) lon_degrees_per_meter = degrees_per_meter / math.cos(math.radians(latitude)) lat_degrees_per_meter = degrees_per_meter # Tile size in meters tile_width_meters = 0.1 / lon_degrees_per_meter tile_height_meters = 0.1 / lat_degrees_per_meter # Pixels in tile pixels_width = int(tile_width_meters / resolution_meters) pixels_height = int(tile_height_meters / resolution_meters) return pixels_width, pixels_height # Examples eq_pixels = pixels_per_tile(0) # ~(1111, 1111) at equator uk_pixels = pixels_per_tile(52) # ~(1823, 1111) in UK arctic_pixels = pixels_per_tile(80) # ~(6389, 1111) near poles Data Format and Storage ----------------------- Quantization System ~~~~~~~~~~~~~~~~~~~ Tessera embeddings are stored using a quantization system for efficiency: **Storage Format**: 1. **Quantized embeddings** (``grid_X.XX_Y.YY.npy``): - Data type: ``int8`` (values -128 to 127) - Shape: ``(height, width, 128)`` - Storage efficient: ~1MB per tile vs ~64MB unquantized 2. **Scale factors** (``grid_X.XX_Y.YY_scales.npy``): - Data type: ``float32`` - Shape: ``(height, width)`` or ``(height, width, 128)`` - Contains dequantization multipliers **Dequantization Process**:: import numpy as np # Load quantized data and scales quantized = np.load("grid_0.15_52.05.npy") # int8 scales = np.load("grid_0.15_52.05_scales.npy") # float32 # Dequantize if scales.ndim == 2: # Broadcast 2D scales to 3D scales = scales[..., np.newaxis] embedding = quantized.astype(np.float32) * scales # Result: (height, width, 128) float32 array This process is handled automatically by ``GeoTessera.fetch_embedding()``, which now returns the dequantized embedding along with CRS and transform information from the corresponding landmask tile. Metadata and Projections ~~~~~~~~~~~~~~~~~~~~~~~~ **Landmask Files** (``grid_X.XX_Y.YY.tiff``): - Provide native UTM projection information for each tile - Define precise geospatial transforms (no reprojection needed) - Preserve original coordinate system for maximum accuracy - Used for georeferencing when exporting to GeoTIFF - Contain binary land/water masks **Projection Selection**: Each tile uses an appropriate UTM zone based on its location:: def get_utm_zone(longitude): """Get UTM zone number for a longitude.""" return int((longitude + 180) / 6) + 1 def get_utm_epsg(longitude, latitude): """Get EPSG code for UTM projection.""" zone = get_utm_zone(longitude) if latitude >= 0: # Northern hemisphere return f"EPSG:{32600 + zone}" else: # Southern hemisphere return f"EPSG:{32700 + zone}" # Example: London at (0.15, 52.05) epsg = get_utm_epsg(0.15, 52.05) # "EPSG:32631" (UTM Zone 31N) Registry System --------------- Parquet-Based Registry ~~~~~~~~~~~~~~~~~~~~~~ The registry uses one **Parquet manifest per dataset version** for efficient data discovery and querying. The manifest is filtered by ``(version, variant)`` at load time, then queried by lat/lon/year: **Manifest Structure** (``s3://tessera-embeddings/{v1,v1.1}/manifest.parquet``): .. code-block:: manifest.parquet (one per dataset version) ├── Columns: │ ├── version # Normalised dataset version ('1.0', '1.1') │ ├── variant # Dataset variant ('vultr', 'cambridge', ...) │ ├── lon, lat # Tile center coordinates │ ├── year # Data year │ ├── grid_size # Embedding NPY byte size │ ├── scales_size # Scales NPY byte size │ ├── grid_path # Full s3:// URI of the embedding │ ├── scales_path # Full s3:// URI of the scales │ ├── grid_mtime # Object mtime on S3 │ └── scales_mtime └── Rows: One per (version, variant, year, lon, lat) tile Integrity is **not** carried in the manifest — it's verified end-to-end at download time against S3's ``x-amz-checksum-crc64nvme`` response header. **Querying the Manifest**:: import pandas as pd df = pd.read_parquet("manifest.parquet") # the v1.1 file # Filter to (version, variant) you want — the manifest can carry both df = df[(df['version'] == '1.1') & (df['variant'] == 'cambridge')] # Query tiles in a region bbox = (-0.2, 51.4, 0.1, 51.6) # (min_lon, min_lat, max_lon, max_lat) tiles = df[ (df['lon'] >= bbox[0]) & (df['lon'] <= bbox[2]) & (df['lat'] >= bbox[1]) & (df['lat'] <= bbox[3]) & (df['year'] == 2024) ] **Manifest Loading Process**: 1. **Download per-version manifest** (only the one matching ``dataset_version``) 2. **Filter** to the requested ``(version, variant)`` 3. **Materialise Point geometry** from lon/lat (R-tree spatial index) 4. **Set MultiIndex** on ``(year, lon_i, lat_i)`` for O(1) lookups 5. **Cache** the parquet at ``~/.cache/geotessera/{v1,v1.1}/manifest.parquet`` with an ETag sidecar for conditional GETs Manifest Sources ~~~~~~~~~~~~~~~~ The manifest can be loaded from multiple sources: **1. Default Remote** (recommended):: # Downloads and caches the v1 manifest automatically. from geotessera import GeoTessera gt = GeoTessera() # Cached at: ~/.cache/geotessera/v1/manifest.parquet # Pick a different (version, variant) — see :ref:`dataset-versions` gt = GeoTessera(dataset_version="v1.1", dataset_variant="cambridge") **2. Local File**:: gt = GeoTessera(registry_path="/path/to/manifest.parquet") **3. Local Directory**:: # Looks for manifest.parquet in the directory (also accepts the legacy # registry.parquet name for backward compat). gt = GeoTessera(registry_dir="/path/to/manifest-dir") **4. Custom URL**:: gt = GeoTessera(registry_url="https://example.com/manifest.parquet") **5. CLI Option**:: geotessera download --cache-dir /custom/cache ... Data Access Layer ----------------- S3 Downloads ~~~~~~~~~~~~ GeoTessera streams tiles directly from the public S3 bucket using anonymous (unsigned) ``botocore`` requests: **Features**: - **Per-output-dir mirroring**: Tiles land in the user-supplied ``--output`` directory and persist there for re-use across runs - **Integrity checking**: End-to-end CRC64NVMe verified against S3's ``x-amz-checksum-crc64nvme`` response header during the body stream - **Conditional caching**: Per-version manifests use ``If-None-Match`` / ``ETag`` sidecars so unchanged manifests yield a 304 with zero body - **Progress callbacks**: Real-time download feedback with speed and size info - **Resumable**: Existing files in the output dir are skipped on rerun **Cache Structure**:: ~/.cache/geotessera/ ├── v1/ │ ├── manifest.parquet # Per-version tile manifest │ ├── manifest.parquet.etag # HTTP ETag for conditional GETs │ ├── landmasks.parquet │ └── landmasks.parquet.etag └── v1.1/ ├── manifest.parquet ├── manifest.parquet.etag ├── landmasks.parquet └── landmasks.parquet.etag # Embedding/landmark tile data lives in the user's --output dir, # not in this cache. The cache holds only per-version manifests. **Download Process**:: import numpy as np from geotessera import dequantize_embedding def fetch_embedding(lon, lat, year): # 1. Fetch the quantized embedding and scales tiles. ``fetch`` returns # a path under embeddings_dir, downloading from S3 if not present. embedding_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=False) scales_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=True) # 2. Load and dequantize quantized = np.load(embedding_file) scales = np.load(scales_file) embedding = dequantize_embedding(quantized, scales) # 3. Get CRS from the landmask tile crs, transform = get_utm_projection_from_landmask(lon, lat) return embedding, crs, transform Persistent Tile Storage ~~~~~~~~~~~~~~~~~~~~~~~~ **Why persist tiles?** - Tiles land in the user-supplied ``--output`` (``embeddings_dir``) and are re-used across runs rather than re-downloaded - Existing files are skipped on rerun, making interrupted downloads resumable - Only the small per-version manifests live in ``~/.cache/geotessera``; the bulk embedding data stays under the output directory the user controls **Cache Configuration**:: from geotessera import GeoTessera # Control where registry is cached gt = GeoTessera(cache_dir="/custom/cache") # Default cache locations: # - Linux/macOS: ~/.cache/geotessera/ # - Windows: %LOCALAPPDATA%/geotessera/ GeoTIFF Export Process ~~~~~~~~~~~~~~~~~~~~~~ When exporting to GeoTIFF, additional processing occurs: **Export Workflow**: 1. **Fetch embedding data** (quantized + scales) 2. **Fetch landmask tile** for projection information 3. **Extract native UTM projection** and transform from landmask 4. **Apply dequantization** to embedding data 5. **Preserve original coordinate system** (no reprojection) 6. **Select bands** (if specified) 7. **Write GeoTIFF** with native UTM CRS and accurate transform 8. **Apply compression** (LZW, DEFLATE, etc.) **Projection Inheritance**:: import rasterio def export_geotiff(embedding, landmask_path, output_path, bands=None): # Read projection from landmask with rasterio.open(landmask_path) as landmask: crs = landmask.crs transform = landmask.transform # Select bands if bands: embedding = embedding[:, :, bands] # Write GeoTIFF with rasterio.open(output_path, 'w', driver='GTiff', height=embedding.shape[0], width=embedding.shape[1], count=embedding.shape[2], dtype=embedding.dtype, crs=crs, transform=transform, compress='lzw') as dst: for i in range(embedding.shape[2]): dst.write(embedding[:, :, i], i + 1) Performance Considerations -------------------------- Memory Management ~~~~~~~~~~~~~~~~~ **Large Region Handling**: When processing large regions, GeoTessera uses several strategies: - **Tile-by-tile processing**: Process one tile at a time to limit memory usage - **Band selection**: Only load required bands to reduce memory footprint - **Generator patterns**: Use generators for large tile collections - **Progress callbacks**: Provide feedback for long operations **Example Memory-Efficient Processing**:: def process_large_region(bbox, year, bands=None): """Process a large region without loading all tiles into memory.""" gt = GeoTessera() # Step 1: Get tile list (metadata only, no data loaded) tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=year) # Step 2: Process tiles one at a time using generator for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch): # Apply band selection early to reduce memory if bands: embedding = embedding[:, :, bands] # Process this tile result = process_single_tile(embedding) # Save or accumulate results save_tile_result(result, tile_lat, tile_lon) # Free memory del embedding Network Optimization ~~~~~~~~~~~~~~~~~~~~ **Sequential Processing**: The fetch_embeddings() generator processes tiles sequentially, which is optimal for most use cases:: # Sequential processing (recommended for most cases) gt = GeoTessera() tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=2024) # Returns generator - tiles are fetched one at a time for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch): process_tile(embedding) # Memory efficient **Point Sampling**: For sampling at specific locations, use the optimized point sampling method:: # Efficient point sampling with automatic tile download points = [(0.15, 52.05), (0.25, 52.15), (-0.05, 51.55)] embeddings = gt.sample_embeddings_at_points(points, year=2024) # With metadata about which tile each point came from embeddings, metadata = gt.sample_embeddings_at_points( points, year=2024, include_metadata=True ) **Cache Efficiency**: - **Pre-warming**: Download commonly used tiles in advance - **Batch processing**: Group requests by geographic region - **Size limits**: Respect server rate limits Zarr Store (Cloud-Native Access) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``GeoTesseraZarr`` class provides cloud-native access to embeddings without downloading files. It implements the ``geoemb:`` convention for geospatial embedding data stored in Zarr v3 format. **Architecture**: Data is organized by UTM zone, with each zone stored as a separate Zarr group. The store automatically routes geographic queries to the correct zone:: zarr store ├── Root attributes (geoemb:model, geoemb:build_version) ├── utm30/ # UTM Zone 30 │ ├── time[:] # Year coordinate array │ ├── embedding # (time, y, x, band) float32 │ └── ... ├── utm31/ # UTM Zone 31 │ └── ... └── ... **Access patterns**: - **Point sampling**: ``sample_points()`` / ``sample_at()`` for extracting embeddings at specific coordinates across zones - **Region reading**: ``read_region()`` for loading rectangular areas as mosaics with CRS and transform metadata - **Zone access**: ``open_zone()`` returns an xarray Dataset with a ``.tessera`` accessor for direct manipulation Datasets are cached per zone for the lifetime of the ``GeoTesseraZarr`` instance. Future Extensions ~~~~~~~~~~~~~~~~~ The architecture supports future enhancements: - **Temporal queries**: Multi-year analysis - **Cloud optimization**: Direct cloud storage access - **ML integration**: TensorFlow/PyTorch data loaders - **Real-time updates**: Live data ingestion - **Distributed processing**: Dask/Ray integration