# Eye-Tracking TSV Normalization: Dots vs Empty Strings

## Decision: Use Empty Strings (Not NaNs or Dots)

### Why Not NaNs?
- **NaN** is a Python-specific numeric concept (`float('nan')`)
- In TSV/CSV files (plain text), NaN cannot be represented directly
- When data is written to text, it becomes the string `"nan"` (4 characters)
- This violates BIDS/TSV standards for missing data

### Why Not Dots?
- While SR Research EyeLink exports use **`.`** to indicate missing values
- BIDS standard prefers **empty cells** (no value between tabs) for missing data
- Dots could be ambiguous - are they missing values or the string "."?
- Empty strings are more portable and universally understood

### What We Do Instead: Empty Strings

```
BEFORE (EyeLink format):
AVERAGE_ACCELERATION_X  AVERAGE_GAZE_X
.                       963.20
-497.78                 965.30

AFTER (PRISM/BIDS format):
AVERAGE_ACCELERATION_X  x
                        963.20
-497.78                 965.30
```

The column between the tabs is **completely empty** - no dot, no "nan", no "NA".

---

## BIDS/TSV Standard for Missing Values

According to [BIDS specification on TSV files](https://bids.neuroimaging.io/getting_started/folders_and_files/metadata/tsv.html):

> **Missing values**: Missing values SHOULD be left empty and not represented as a string.

### Valid representations for missing values in TSV:
- ✅ Empty cell (nothing between tabs)
- ✅ Column not present (column dropped entirely)

### Invalid representations:
- ❌ `"."` (dot)
- ❌ `"NA"` (string)
- ❌ `"NaN"` (string representation of NaN)
- ❌ `"null"` (JSON-style)

---

## How to Handle Missing Values When Reading

When your analysis code **reads** these TSV files:

### Python (pandas)
```python
import pandas as pd

df = pd.read_csv('sub-17_ses-1_task-gaze_eyetrack.tsv', sep='\t')

# Empty strings are automatically treated as missing:
# To explicitly convert to NaN:
df = df.replace('', pd.NA)

# To work with them:
print(df['x'].isna())  # Shows True for empty cells
print(df['x'].dropna())  # Removes rows with missing x
```

### R
```r
df <- read.delim('sub-17_ses-1_task-gaze_eyetrack.tsv')

# Empty strings are automatically NA in R
df$x[is.na(df$x)]  # Find missing values
```

### NumPy/SciPy
```python
import numpy as np

data = np.genfromtxt('sub-17_ses-1_task-gaze_eyetrack.tsv', 
                     delimiter='\t', 
                     dtype=None, 
                     encoding='utf-8',
                     missing_values='',  # Treat empty as missing
                     filling_values=np.nan)  # Replace with NaN
```

---

## Summary of Changes

The updated `_process_eyetracking_tsv()` function now:

1. **Drops** `RECORDING_SESSION_LABEL` column
   - Redundant: filename already encodes `sub-17_ses-1`
   - Saves ~30 bytes per row × millions of rows

2. **Converts dots to empty strings**
   - Complies with BIDS/TSV standard
   - Makes data more portable
   - Keeps analysis tools happy

3. **Renames columns to BIDS-style**
   - `AVERAGE_GAZE_X` → `x`
   - `AVERAGE_GAZE_Y` → `y`
   - `AVERAGE_PUPIL_SIZE` → `pupil_size`
   - `TIMESTAMP` → `timestamp`

4. **Preserves all other columns**
   - Kinematic data (accelerations, velocities)
   - Blink/saccade flags
   - Trial indices
   - Metadata (SAMPLE_MESSAGE)

---

## Example: Before and After

### BEFORE (Raw EyeLink Export)
```text
RECORDING_SESSION_LABEL	TRIAL_INDEX	AVERAGE_ACCELERATION_X	AVERAGE_GAZE_X	TIMESTAMP
s17_nr_1	1	.	963.20	5529512.00
s17_nr_1	1	-497.78	965.30	5529521.00
```

### AFTER (PRISM Normalized)
```text
TRIAL_INDEX	AVERAGE_ACCELERATION_X	x	timestamp
1		963.20	5529512.00
1	-497.78	965.30	5529521.00
```

---

## Configuration in JSON Sidecar

The JSON sidecar documents this normalization:

```json
{
  "Technical": {
    "FileFormat": "tsv",
    "ProcessingLevel": "parsed",
    "NormalizationApplied": {
      "DroppedColumns": ["RECORDING_SESSION_LABEL"],
      "RenamedColumns": {
        "AVERAGE_GAZE_X": "x",
        "AVERAGE_GAZE_Y": "y",
        "AVERAGE_PUPIL_SIZE": "pupil_size",
        "TIMESTAMP": "timestamp"
      },
      "MissingValueNormalization": {
        "From": "dots (.)",
        "To": "empty strings",
        "Standard": "BIDS-compatible"
      }
    }
  }
}
```

---

## References

- [BIDS Specification - TSV Format](https://bids.neuroimaging.io/getting_started/folders_and_files/metadata/tsv.html)
- [BEP 020 - Eye Tracking](https://bids.neuroimaging.io/extensions/beps/bep_020.html)
- [Python/pandas handling of missing values](https://pandas.pydata.org/docs/user_guide/missing_data.html)