Architecture Guide¶
This guide explains the internal architecture of GeoTessera and how the various components work together to provide efficient access to Tessera embeddings.
Overview¶
GeoTessera is designed around a simple but powerful architecture that optimizes for:
Efficient data access: Only download what you need
Projection preservation: Maintain native UTM projections for accuracy
Scalability: Handle large datasets with lazy loading
Flexibility: Support both analysis and GIS workflows
Reliability: Ensure data integrity with checksums
Core Architecture¶
The library follows a layered architecture:
User Interface Layer
├── CLI Commands (geotessera download, visualize, etc.)
└── Python API (GeoTessera class)
↓
Core Processing Layer
├── GeoTessera class (main interface)
├── Registry (Parquet-based data discovery)
└── Visualization (rendering and web maps)
↓
Data Access Layer
├── Anonymous S3 downloads (botocore)
├── Zarr v3 store (cloud-native streaming)
├── Rasterio (GeoTIFF I/O)
└── GeoPandas (geospatial operations)
↓
Storage Layer
├── Remote servers (https://s3.us-west-2.amazonaws.com/tessera-embeddings)
├── Zarr store (https://s3.us-west-2.amazonaws.com/tessera-embeddings/v1/zarr)
└── Local cache (~/.cache/geotessera/{v1,v1.1}/manifest.parquet)
Coordinate System and Grid¶
Understanding the Tessera Grid¶
The Tessera embeddings are organized on a 0.1-degree grid system:
Grid Properties:
Grid spacing: 0.1° latitude × 0.1° longitude
Tile naming: Named by center coordinates (e.g.,
grid_0.15_52.05)Coverage: Each tile spans from (center - 0.05°) to (center + 0.05°)
Resolution: Approximately 11km × 11km at the equator
Coordinate Calculations:
# For a tile at center coordinates (lon, lat)
west = lon - 0.05
east = lon + 0.05
south = lat - 0.05
north = lat + 0.05
Grid Alignment:
Tile centers are aligned to 0.1-degree boundaries:
# Valid tile centers (examples)
valid_centers = [
(0.05, 52.05), # Northwest Europe
(0.15, 52.05), # Adjacent tile
(-0.05, 51.95), # Southwest tile
]
# Invalid centers (not on grid)
invalid_centers = [
(0.07, 52.03), # Off-grid
(0.1, 52.1), # Off by 0.05°
]
Resolution and Pixel Density¶
The number of pixels per tile varies with latitude due to the Earth’s curvature:
import math
def pixels_per_tile(latitude, resolution_meters=10):
"""Calculate approximate pixels per tile at given latitude."""
# Earth circumference at equator (meters)
earth_circumference = 40075000
# Degrees per meter at equator
degrees_per_meter = 360 / earth_circumference
# Adjust for latitude (longitude only)
lon_degrees_per_meter = degrees_per_meter / math.cos(math.radians(latitude))
lat_degrees_per_meter = degrees_per_meter
# Tile size in meters
tile_width_meters = 0.1 / lon_degrees_per_meter
tile_height_meters = 0.1 / lat_degrees_per_meter
# Pixels in tile
pixels_width = int(tile_width_meters / resolution_meters)
pixels_height = int(tile_height_meters / resolution_meters)
return pixels_width, pixels_height
# Examples
eq_pixels = pixels_per_tile(0) # ~(1111, 1111) at equator
uk_pixels = pixels_per_tile(52) # ~(1823, 1111) in UK
arctic_pixels = pixels_per_tile(80) # ~(6389, 1111) near poles
Data Format and Storage¶
Quantization System¶
Tessera embeddings are stored using a quantization system for efficiency:
Storage Format:
Quantized embeddings (
grid_X.XX_Y.YY.npy):Data type:
int8(values -128 to 127)Shape:
(height, width, 128)Storage efficient: ~1MB per tile vs ~64MB unquantized
Scale factors (
grid_X.XX_Y.YY_scales.npy):Data type:
float32Shape:
(height, width)or(height, width, 128)Contains dequantization multipliers
Dequantization Process:
import numpy as np
# Load quantized data and scales
quantized = np.load("grid_0.15_52.05.npy") # int8
scales = np.load("grid_0.15_52.05_scales.npy") # float32
# Dequantize
if scales.ndim == 2:
# Broadcast 2D scales to 3D
scales = scales[..., np.newaxis]
embedding = quantized.astype(np.float32) * scales
# Result: (height, width, 128) float32 array
This process is handled automatically by GeoTessera.fetch_embedding(), which now returns the dequantized embedding along with CRS and transform information from the corresponding landmask tile.
Metadata and Projections¶
Landmask Files (grid_X.XX_Y.YY.tiff):
Provide native UTM projection information for each tile
Define precise geospatial transforms (no reprojection needed)
Preserve original coordinate system for maximum accuracy
Used for georeferencing when exporting to GeoTIFF
Contain binary land/water masks
Projection Selection:
Each tile uses an appropriate UTM zone based on its location:
def get_utm_zone(longitude):
"""Get UTM zone number for a longitude."""
return int((longitude + 180) / 6) + 1
def get_utm_epsg(longitude, latitude):
"""Get EPSG code for UTM projection."""
zone = get_utm_zone(longitude)
if latitude >= 0:
# Northern hemisphere
return f"EPSG:{32600 + zone}"
else:
# Southern hemisphere
return f"EPSG:{32700 + zone}"
# Example: London at (0.15, 52.05)
epsg = get_utm_epsg(0.15, 52.05) # "EPSG:32631" (UTM Zone 31N)
Registry System¶
Parquet-Based Registry¶
The registry uses one Parquet manifest per dataset version for efficient
data discovery and querying. The manifest is filtered by (version,
variant) at load time, then queried by lat/lon/year:
Manifest Structure (s3://tessera-embeddings/{v1,v1.1}/manifest.parquet):
manifest.parquet (one per dataset version)
├── Columns:
│ ├── version # Normalised dataset version ('1.0', '1.1')
│ ├── variant # Dataset variant ('vultr', 'cambridge', ...)
│ ├── lon, lat # Tile center coordinates
│ ├── year # Data year
│ ├── grid_size # Embedding NPY byte size
│ ├── scales_size # Scales NPY byte size
│ ├── grid_path # Full s3:// URI of the embedding
│ ├── scales_path # Full s3:// URI of the scales
│ ├── grid_mtime # Object mtime on S3
│ └── scales_mtime
└── Rows: One per (version, variant, year, lon, lat) tile
Integrity is not carried in the manifest — it’s verified end-to-end at
download time against S3’s x-amz-checksum-crc64nvme response header.
Querying the Manifest:
import pandas as pd
df = pd.read_parquet("manifest.parquet") # the v1.1 file
# Filter to (version, variant) you want — the manifest can carry both
df = df[(df['version'] == '1.1') & (df['variant'] == 'cambridge')]
# Query tiles in a region
bbox = (-0.2, 51.4, 0.1, 51.6) # (min_lon, min_lat, max_lon, max_lat)
tiles = df[
(df['lon'] >= bbox[0]) & (df['lon'] <= bbox[2]) &
(df['lat'] >= bbox[1]) & (df['lat'] <= bbox[3]) &
(df['year'] == 2024)
]
Manifest Loading Process:
Download per-version manifest (only the one matching
dataset_version)Filter to the requested
(version, variant)Materialise Point geometry from lon/lat (R-tree spatial index)
Set MultiIndex on
(year, lon_i, lat_i)for O(1) lookupsCache the parquet at
~/.cache/geotessera/{v1,v1.1}/manifest.parquetwith an ETag sidecar for conditional GETs
Manifest Sources¶
The manifest can be loaded from multiple sources:
1. Default Remote (recommended):
# Downloads and caches the v1 manifest automatically.
from geotessera import GeoTessera
gt = GeoTessera()
# Cached at: ~/.cache/geotessera/v1/manifest.parquet
# Pick a different (version, variant) — see :ref:`dataset-versions`
gt = GeoTessera(dataset_version="v1.1", dataset_variant="cambridge")
2. Local File:
gt = GeoTessera(registry_path="/path/to/manifest.parquet")
3. Local Directory:
# Looks for manifest.parquet in the directory (also accepts the legacy
# registry.parquet name for backward compat).
gt = GeoTessera(registry_dir="/path/to/manifest-dir")
4. Custom URL:
gt = GeoTessera(registry_url="https://example.com/manifest.parquet")
5. CLI Option:
geotessera download --cache-dir /custom/cache ...
Data Access Layer¶
S3 Downloads¶
GeoTessera streams tiles directly from the public S3 bucket using anonymous
(unsigned) botocore requests:
Features:
Per-output-dir mirroring: Tiles land in the user-supplied
--outputdirectory and persist there for re-use across runsIntegrity checking: End-to-end CRC64NVMe verified against S3’s
x-amz-checksum-crc64nvmeresponse header during the body streamConditional caching: Per-version manifests use
If-None-Match/ETagsidecars so unchanged manifests yield a 304 with zero bodyProgress callbacks: Real-time download feedback with speed and size info
Resumable: Existing files in the output dir are skipped on rerun
Cache Structure:
~/.cache/geotessera/
├── v1/
│ ├── manifest.parquet # Per-version tile manifest
│ ├── manifest.parquet.etag # HTTP ETag for conditional GETs
│ ├── landmasks.parquet
│ └── landmasks.parquet.etag
└── v1.1/
├── manifest.parquet
├── manifest.parquet.etag
├── landmasks.parquet
└── landmasks.parquet.etag
# Embedding/landmark tile data lives in the user's --output dir,
# not in this cache. The cache holds only per-version manifests.
Download Process:
import numpy as np
from geotessera import dequantize_embedding
def fetch_embedding(lon, lat, year):
# 1. Fetch the quantized embedding and scales tiles. ``fetch`` returns
# a path under embeddings_dir, downloading from S3 if not present.
embedding_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=False)
scales_file = registry.fetch(year=year, lon=lon, lat=lat, is_scales=True)
# 2. Load and dequantize
quantized = np.load(embedding_file)
scales = np.load(scales_file)
embedding = dequantize_embedding(quantized, scales)
# 3. Get CRS from the landmask tile
crs, transform = get_utm_projection_from_landmask(lon, lat)
return embedding, crs, transform
Persistent Tile Storage¶
Why persist tiles?
Tiles land in the user-supplied
--output(embeddings_dir) and are re-used across runs rather than re-downloadedExisting files are skipped on rerun, making interrupted downloads resumable
Only the small per-version manifests live in
~/.cache/geotessera; the bulk embedding data stays under the output directory the user controls
Cache Configuration:
from geotessera import GeoTessera
# Control where registry is cached
gt = GeoTessera(cache_dir="/custom/cache")
# Default cache locations:
# - Linux/macOS: ~/.cache/geotessera/
# - Windows: %LOCALAPPDATA%/geotessera/
GeoTIFF Export Process¶
When exporting to GeoTIFF, additional processing occurs:
Export Workflow:
Fetch embedding data (quantized + scales)
Fetch landmask tile for projection information
Extract native UTM projection and transform from landmask
Apply dequantization to embedding data
Preserve original coordinate system (no reprojection)
Select bands (if specified)
Write GeoTIFF with native UTM CRS and accurate transform
Apply compression (LZW, DEFLATE, etc.)
Projection Inheritance:
import rasterio
def export_geotiff(embedding, landmask_path, output_path, bands=None):
# Read projection from landmask
with rasterio.open(landmask_path) as landmask:
crs = landmask.crs
transform = landmask.transform
# Select bands
if bands:
embedding = embedding[:, :, bands]
# Write GeoTIFF
with rasterio.open(output_path, 'w',
driver='GTiff',
height=embedding.shape[0],
width=embedding.shape[1],
count=embedding.shape[2],
dtype=embedding.dtype,
crs=crs,
transform=transform,
compress='lzw') as dst:
for i in range(embedding.shape[2]):
dst.write(embedding[:, :, i], i + 1)
Performance Considerations¶
Memory Management¶
Large Region Handling:
When processing large regions, GeoTessera uses several strategies:
Tile-by-tile processing: Process one tile at a time to limit memory usage
Band selection: Only load required bands to reduce memory footprint
Generator patterns: Use generators for large tile collections
Progress callbacks: Provide feedback for long operations
Example Memory-Efficient Processing:
def process_large_region(bbox, year, bands=None):
"""Process a large region without loading all tiles into memory."""
gt = GeoTessera()
# Step 1: Get tile list (metadata only, no data loaded)
tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=year)
# Step 2: Process tiles one at a time using generator
for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
# Apply band selection early to reduce memory
if bands:
embedding = embedding[:, :, bands]
# Process this tile
result = process_single_tile(embedding)
# Save or accumulate results
save_tile_result(result, tile_lat, tile_lon)
# Free memory
del embedding
Network Optimization¶
Sequential Processing:
The fetch_embeddings() generator processes tiles sequentially, which is optimal for most use cases:
# Sequential processing (recommended for most cases)
gt = GeoTessera()
tiles_to_fetch = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)
# Returns generator - tiles are fetched one at a time
for year, tile_lon, tile_lat, embedding, crs, transform in gt.fetch_embeddings(tiles_to_fetch):
process_tile(embedding) # Memory efficient
Point Sampling:
For sampling at specific locations, use the optimized point sampling method:
# Efficient point sampling with automatic tile download
points = [(0.15, 52.05), (0.25, 52.15), (-0.05, 51.55)]
embeddings = gt.sample_embeddings_at_points(points, year=2024)
# With metadata about which tile each point came from
embeddings, metadata = gt.sample_embeddings_at_points(
points, year=2024, include_metadata=True
)
Cache Efficiency:
Pre-warming: Download commonly used tiles in advance
Batch processing: Group requests by geographic region
Size limits: Respect server rate limits
Zarr Store (Cloud-Native Access)¶
The GeoTesseraZarr class provides cloud-native access to embeddings
without downloading files. It implements the geoemb: convention for
geospatial embedding data stored in Zarr v3 format.
Architecture:
Data is organized by UTM zone, with each zone stored as a separate Zarr group. The store automatically routes geographic queries to the correct zone:
zarr store
├── Root attributes (geoemb:model, geoemb:build_version)
├── utm30/ # UTM Zone 30
│ ├── time[:] # Year coordinate array
│ ├── embedding # (time, y, x, band) float32
│ └── ...
├── utm31/ # UTM Zone 31
│ └── ...
└── ...
Access patterns:
Point sampling:
sample_points()/sample_at()for extracting embeddings at specific coordinates across zonesRegion reading:
read_region()for loading rectangular areas as mosaics with CRS and transform metadataZone access:
open_zone()returns an xarray Dataset with a.tesseraaccessor for direct manipulation
Datasets are cached per zone for the lifetime of the GeoTesseraZarr
instance.
Future Extensions¶
The architecture supports future enhancements:
Temporal queries: Multi-year analysis
Cloud optimization: Direct cloud storage access
ML integration: TensorFlow/PyTorch data loaders
Real-time updates: Live data ingestion
Distributed processing: Dask/Ray integration