Files: AAREADME.txt
Database: TUH EEG Epilepsy Corpus
Version: v3.0.0

-------------------------------------------------------------------------------
Change Log:

 v3.0.0 (20260107): Signal data was annotated; metadata was included.
 v2.0.1 (20240207): Headers were modified. No change to the signal data.

-------------------------------------------------------------------------------
This file contains some basic statistics about the TUH EEG Epilepsy
Corpus, a corpus developed to motivate the development of new methods
for automatic analysis of EEG files using machine learning. This
corpus is a subset of the TUH EEG Corpus and contains sessions from
patients with epilepsy. To balance the corpus, some sessions are
provided from patients that do not have epilepsy.

Subjects were sorted into epilepsy and no epilepsy categories by searching
the associated EEG reports for indications as to an epilepsy/no epilepsy 
diagnosis based on clinical history, medications at the time of recording, 
and EEG features associated with epilepsy such as spike and sharp waves.
A board-certified neurologist, Daniel Goldenholz, and his research team
reviewed and verified the decisions about each patient.

When you use this specific corpus in your research or technology development, 
we ask that you reference the corpus using this publication:

 Veloso, L., McHugh, J. R., von Weltin, E., Obeid, I., & Picone,
 J. (2017). Big Data Resources for EEGs: Enabling Deep Learning
 Research. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE
 Signal Processing in Medicine and Biology Symposium
 (p. 1). Philadelphia, Pennsylvania, USA: IEEE.

This publication can be retrieved from:

 https://www.isip.piconepress.com/publications/conference_presentations/2017/ieee_spmb/data/

Our preferred reference for the TUH EEG Corpus, from which this
seizure corpus was derived, is:

 Obeid, I., & Picone, J. (2016). The Temple University Hospital EEG
 Data Corpus. Frontiers in Neuroscience, Section Neural Technology,
 10, 196.

v3.0.0 of the TUH EEG Epilesy Corpus was based on v2.0.1 of the
TUH EEG Corpus and v2.0.5 of the TUH EEG Corpus. Please see the documentation
for TUH EEG v2.0.1 to understand how the data is structured.

Our annotation guidelines are documented here:

 Melles, A.-M., Paderewski, M., Oymann, R., Shah, V., Obeid, I., &
 Picone, J. (2025). The Natus Medical Incorporated Ambulatory EEG
 Corpus: Annotation Guidelines (p. 16). Temple
 University.

which can be retrieved from here:

 https://isip.piconepress.com/publications/reports/2025/nmae/annotations/

BASIC STATISTICS:

  |--------------------------------------------------------|
  | Description | (00) Epilepsy | (01) No Epilepsy | Total |
  |-------------+---------------+------------------+-------|
  | Patients    |           100 |              100 |   200 |
  |-------------+---------------+------------------+-------|
  | Sessions    |           530 |              168 |   698 |
  |-------------+---------------+------------------+-------|
  | Files       |         2,257 |              564 | 2,821 |
  |--------------------------------------------------------|

The total size of the corpus is 36 Gbytes.

There are several new features of this version of the corpus. First,
the files have been limited to 30 mins. in durations. Second, the data
has been annotated for seizures following the conventions used in
TUSZ. Third, the EEG reports have been analyzed and summarized in a
spreadsheet in /DOCS.

The directory /DOCS contains a few new things. First, there are the
montages that are used to visualize and annotate the data:

 01_tcp_ar_montage.txt
 02_tcp_le_montage.txt
 03_tcp_ar_a_montage.txt
 04_tcp_le_a_montage.txt

Next, there is the metadata spreadsheet that contains information about
each session and subject, such as a diagnosis and medication history:

 metadata_v00r.xlsx

Entries are provided per session. It is not uncommon that there are differences
in some metadata between sessions. EEG reports are inherently noisy. We
report information found in the report for each session, whether or not
thst is consistent with the other sessions.

Finally, there are two lists:

 sessions_common_with_tusz.list
 sessions_unique_to_tuep.list

that sort sessions based on whether they appear in TUSZ.

There are three types of files in this release:

 *.edf:    the EEG sampled data in European Data Format (edf)
 *.csv:    event-based annotations using all available seizure type classes
 *.csv_bi: term-based annotations using only two labels (bckg and seiz)

These are described in more detail in the TUSZ Corpus.

Finally, here are some basic descriptive statistics about the data.
The commands used to generate these numbers are shown below.
For the commands below, the starting point was here:

 nedc_130_[1]: pwd
 /data/isip/data/tuh_eeg_epilepsy/v3.0.0

( 1) Number of files:

 nedc_130_[1]: find 00_* -name "*.edf" | wc -l
 2257
 nedc_130_[1]: find 01_* -name "*.edf" | wc -l
 564
 nedc_130_[1]: find 0?_* -name "*.edf" | wc -l
 2821

( 2) Number of sessions:

 nedc_130_[1]: find 00_* -mindepth 2 -maxdepth 2 | wc -l
 530
 nedc_130_[1]: find 01_* -mindepth 2 -maxdepth 2 | wc -l
 168
 nedc_130_[1]: find 0?_* -mindepth 2 -maxdepth 2 | wc -l
 698

( 3) Number of subjects:

 nedc_130_[1]: find 00_* -mindepth 1 -maxdepth 1 | wc -l
 100
 nedc_130_[1]: find 01_* -mindepth 1 -maxdepth 1 | wc -l
 100
 nedc_130_[1]: find 0?_* -mindepth 1 -maxdepth 1 | wc -l
 200

( 4) Number of files with seizures:

 nedc_130_[1]: find 00_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc -l
 128
 nedc_130_[1]: find 01_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc -l
 1
 nedc_130_[1]: find 0?_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f5 | cut -d":" -f1 | sort -u | wc -l
 129

( 5) Number of sessions with seizures:

 nedc_130_[1]: find 00_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc -l
 45
 nedc_130_[1]: find 01_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc -l
 0
 nedc_130_[1]: find 0?_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc -l
 45

( 6) Number of patients with seizures:

 nedc_130_[1]: find 00_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc -l
 14
 nedc_130_[1]: find 01_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc -l
 0
 nedc_130_[1]: find 0?_* -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc -l
 14

( 7) Total number of seizure events (measured using *.csv_bi):

 nedc_130_[1]: find 00_* -name "*.csv_bi" -exec grep -H seiz {} \; | wc -l
 351
 nedc_130_[1]: find 01_* -name "*.csv_bi" -exec grep -H seiz {} \; | wc -l
 0
 nedc_130_[1]: find 0?_* -name "*.csv_bi" -exec grep -H seiz {} \; | wc -l
 351

( 8) Total duration (in secs):

 nedc_130_[1]: find 00_* -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}'
 1909278
 nedc_130_[1]: find 01_* -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}'
 365369
 nedc_130_[1]: find 0?_* -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}'
 2274647

( 9) Total size of the corpus (00_* + 01_*):

 nedc_130_[1]: cd  /data/isip/data/tuh_eeg_seizure/
 nedc_130_[1]: du -sBM v3.0.0
 36448M	v3.0.0

(10) Total duration of seizure events (in secs):

 nedc_130_[1]: find 00_* -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}'
 20728.2
 nedc_130_[1]: find 01_* -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}'

 nedc_130_[1]: find 0?_* -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}'
 20728.2

-----------------------------

If you have any additional comments or questions about the data,
please direct them to help@nedcdata.org.

Best regards,

Joe Picone