File: AAREADME.txt Database: TUH EEG Seizure Corpus (TUSZ) Version: 2.0.3 ------------------------------------------------------------------------------- Change Log: v2.0.3 (20250618): Fixed annotation issues with aaaaajdn_s003_t003 and aaaaalmx_s002 (split a long file into parts) v2.0.3 (20250401): Corrected a few corrupted files to match TUEG v2.0.3 (20240207): Headers were modified. No change to the signal data. v2.0.2 (20240113): Removed duplicate montages for two sessions in /eval: eval/aaaaaqvx/s003_2015_08_24/ eval/aaaaaqvx/s010_2015_08_27/ 03_tcp_ar_a was retained and 01_tcp_ar was deleted. Added a seizure event for: dev/aaaaadkj/s002_2007_10_22/02_tcp_le v2.0.1 (20231004): A few problems with the start and end times of seizure events were corrected, including boundaries that exceeded the end of the file over overlapped on the same channel. Most of these were related to issues with the annotator tool. Several short gaps between two adjacent seizure events were removed. There are 35 files that changed. These are listed at the bottom of this file. ------------------------------------------------------------------------------- This file contains some basic statistics about the TUH EEG Seizure Corpus, a corpus developed to motivate the development of high performance seizure detection algorithms using machine learning. This corpus is a subset of the TUH EEG Corpus and contains sessions that are known to contain seizure events. To balance the corpus, some sessions are provided that do not contain seizure events, so that the false alarm performance of a system can be tested. When you use this specific corpus in your research or technology development, we ask that you reference the corpus using this publication: Shah, V., von Weltin, E., Lopez. S., McHugh, J., Veloso, L., Golmohammadi, M., Obeid, I., and Picone, J. (2018). The Temple University Hospital Seizure Detection Corpus. Frontiers in Neuroinformatics. 12:83. doi: 10.3389/fninf.2018.00083 This publication can be retrieved from: https://www.isip.piconepress.com/publications/journals_refereed/2018/frontiers_neuroscience/tuh_eeg_seizure Our preferred reference for the TUH EEG Corpus, from which this seizure corpus was derived, is: Obeid, I., & Picone, J. (2018). The Temple University Hospital EEG Data Corpus. In Augmentation of Brain Function: Facts, Fiction and Controversy. Volume I: Brain-Machine Interfaces (1st ed., pp. 394–398). Lausanne, Switzerland: Frontiers Media S.A. The data in this release was based on v2.0.3 of the TUH EEG Corpus. There are three main directories in this release: train, dev and eval. The train directory contains data you are allowed to use for the development of your technology. The dev data is disjoint from the training set and should only be used for testing. Eval is a blind evaluation set - you should never optimize parameters on this set. The top-level directories: edf/dev, edf/eval and edf/train. Please see the documentation for TUH EEG v2.0.3 to understand how the data is structured. There are three types of files in this release (older formats have been obsoleted): *.edf: the EEG sampled data in European Data Format (edf) *.csv: event-based annotations using all available seizure type classes *.csv_bi: term-based annotations using only two labels (bckg and seiz) Event-based annotations are per-channel. This means the annotation contains, in addition to a start and stop time, a channel index. Seizures often can be observed on one or more channels and then spread to other channels. Event-based annotations capture this. Term-based annotations use one label that applies to all channels. These are most useful for machine learning research in which we tend to worry only about the overall classification of a segment and are not concerned about individual channels. Bi-class annotations use two labels: seizure (seiz) and background (bckg). The multi-class annotations use all available seizure types. These are described in the spreadsheet: $TUSZ/v2.0.3/DOCS/seizures_types_v02.xlsx Clinical EEGs use a variety of channel configurations. In the larger TUH EEG Corpus, there are over 40 different channel configurations. In this subset, there are two type of EEGs: averaged reference (AR) and linked ears reference (LE). Fortunately, all files in this subset contain the standard channels you would expect from a 10/20 configuration, and all files can be converted to a TCP montage (which is what we use internally for our processing). To learn more about this, please consult the following publication: Lopez, S., Gross, A., Yang, S., Golmohammadi, M., Obeid, I., & Picone, J. (2016). An Analysis of Two Common Reference Points for EEGs. In IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–4). Philadelphia, Pennsylvania, USA. Available at: https://www.isip.piconepress.com/publications/conference_proceedings/2016/ieee_spmb/montages/. The channel number in csv files refers to the channels defined using a standard ACNS TCP montage. This is our preferred way of viewing seizure data. The montage is defined as follows: montage = 0, FP1-F7: EEG FP1-REF -- EEG F7-REF montage = 1, F7-T3: EEG F7-REF -- EEG T3-REF montage = 2, T3-T5: EEG T3-REF -- EEG T5-REF montage = 3, T5-O1: EEG T5-REF -- EEG O1-REF montage = 4, FP2-F8: EEG FP2-REF -- EEG F8-REF montage = 5, F8-T4 : EEG F8-REF -- EEG T4-REF montage = 6, T4-T6: EEG T4-REF -- EEG T6-REF montage = 7, T6-O2: EEG T6-REF -- EEG O2-REF montage = 8, A1-T3: EEG A1-REF -- EEG T3-REF montage = 9, T3-C3: EEG T3-REF -- EEG C3-REF montage = 10, C3-CZ: EEG C3-REF -- EEG CZ-REF montage = 11, CZ-C4: EEG CZ-REF -- EEG C4-REF montage = 12, C4-T4: EEG C4-REF -- EEG T4-REF montage = 13, T4-A2: EEG T4-REF -- EEG A2-REF montage = 14, FP1-F3: EEG FP1-REF -- EEG F3-REF montage = 15, F3-C3: EEG F3-REF -- EEG C3-REF montage = 16, C3-P3: EEG C3-REF -- EEG P3-REF montage = 17, P3-O1: EEG P3-REF -- EEG O1-REF montage = 18, FP2-F4: EEG FP2-REF -- EEG F4-REF montage = 19, F4-C4: EEG F4-REF -- EEG C4-REF montage = 20, C4-P4: EEG C4-REF -- EEG P4-REF montage = 21, P4-O2: EEG P4-REF -- EEG O2-REF For example, channel 1 is a difference between electrodes F7 and T3, and represents an arithmetic difference of the channels (F7-REF)-(T3-REF), which are channels contained in the EDF file. For files in the 02_tcp_le montage the channels are named as EEG P4-LE. All channel derivations are the same. For files in the 03_tcp_ar_a montage the derivations EEG A1-REF and EEG A2-REF are not included. Finally, here are some basic descriptive statistics about the data. The commands used to generate these numbers are (/dev is used as an example) shown below. For the commands below, the starting point was here: /data/isip/data/tuh_eeg_seizure/v2.0.3/edf ( 1) Number of files: nedc_130_[1]: find . -name "*.edf" | wc 7364 7364 439929 nedc_130_[1]: find ./train -name "*.edf" | wc 4667 4667 282030 nedc_130_[1]: find ./dev -name "*.edf" | wc 1832 1832 106746 nedc_130_[1]: find ./eval -name "*.edf" | wc 865 865 51153 nedc_130_[1]: find . -name "*.csv" | wc 7364 7364 439929 nedc_130_[1]: find . -name "*.csv_bi" | wc 7364 7364 462021 ( 2) Number of sessions: nedc_130_[1]: find * -mindepth 3 -maxdepth 3 | wc 1643 1643 57117 nedc_130_[1]: find train -mindepth 3 -maxdepth 3 | wc 1175 1175 41461 nedc_130_[1]: find dev -mindepth 3 -maxdepth 3 | wc 342 342 11358 nedc_130_[1]: find eval -mindepth 3 -maxdepth 3 | wc 126 126 4298 ( 3) Number of patients: nedc_130_[1]: find train -mindepth 1 -maxdepth 1 | wc 579 579 8685 nedc_130_[1]: find dev -mindepth 1 -maxdepth 1 | wc 53 53 689 nedc_130_[1]: find eval -mindepth 1 -maxdepth 1 | wc 43 43 602 ( 4) Number of files with seizures: nedc_130_[1]: find train -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc 873 873 20079 nedc_130_[1]: find dev -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc 325 325 7475 nedc_130_[1]: find eval -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc 195 195 4485 ( 5) Number of sessions with seizures: nedc_130_[1]: find train -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc 352 352 6688 nedc_130_[1]: find dev -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc 114 114 2166 nedc_130_[1]: find eval -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc 63 63 1197 ( 6) Number of patients with seizures: nedc_130_[1]: find train -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc 208 208 1872 nedc_130_[1]: find dev -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc 45 45 405 nedc_130_[1]: find eval -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc 34 34 306 ( 7) Total number of seizure events (measured using *.csv_bi): nedc_130_[1]: find train -name "*.csv_bi" -exec grep -H seiz {} \; | wc 2420 2420 233300 nedc_130_[1]: find dev -name "*.csv_bi" -exec grep -H seiz {} \; | wc 1075 1075 101655 nedc_130_[1]: find eval -name "*.csv_bi" -exec grep -H seiz {} \; | wc 469 469 44540 ( 8) Total duration: nedc_130_[1]: find train -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}' 3277229 nedc_130_[1]: find dev -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}' 1567972 nedc_130_[1]: find eval -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}' 459713 ( 9) Total size of the corpus (/train + /dev + /eval): 81,492 Mbytes (81.4 Gbytes) nedc_130_[1]: cd /data/isip/data/tuh_eeg_seizure/ nedc_130_[1]: du -sBM v2.0.3 81537M v2.0.3 (10) Total duration of seizure events: nedc_130_[1]: find train -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}' 171394 nedc_130_[1]: find dev -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}' 71310.6 nedc_130_[1]: find eval -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}' 27246.7 ----------------------------- If you have any additional comments or questions about the data, please direct them to help@nedcdata.org. Best regards, Joe Picone