Spaces:
Running
Running
| id: "duckdb-leakage-detection" | |
| title: "Leakage Detection" | |
| slug: "duckdb-leakage-detection-query" | |
| description: "Detect data leakage between train and test datasets by calculating the overlap percentage." | |
| code: | | |
| WITH | |
| overlapping_rows AS ( | |
| SELECT COUNT(*) AS overlap_count | |
| FROM train | |
| INTERSECT | |
| SELECT COUNT(*) AS overlap_count | |
| FROM test | |
| ), | |
| total_unique_rows AS ( | |
| SELECT COUNT(*) AS total_count | |
| FROM ( | |
| SELECT * FROM train | |
| UNION | |
| SELECT * FROM test | |
| ) combined | |
| ) | |
| SELECT | |
| overlap_count, | |
| total_count, | |
| (overlap_count * 100.0 / total_count) AS overlap_percentage | |
| FROM overlapping_rows, total_unique_rows; | |
| # Dataset Leakage Detection | |
| This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage. | |
| ```sql | |
| WITH | |
| overlapping_rows AS ( | |
| SELECT COUNT(*) AS overlap_count | |
| FROM train | |
| INTERSECT | |
| SELECT COUNT(*) AS overlap_count | |
| FROM test | |
| ), | |
| total_unique_rows AS ( | |
| SELECT COUNT(*) AS total_count | |
| FROM ( | |
| SELECT * FROM train | |
| UNION | |
| SELECT * FROM test | |
| ) combined | |
| ) | |
| SELECT | |
| overlap_count, | |
| total_count, | |
| (overlap_count * 100.0 / total_count) AS overlap_percentage | |
| FROM overlapping_rows, total_unique_rows; | |
| ``` | |
| There is a very good in depth explanation of leakage in public datasets in this [article](https://huggingface.co/blog/lbourdois/lle). | |
| # Example Query | |
| You can check out this [link](https://huggingface.co/datasets/stanfordnlp/imdb?sql_console=true&sql=WITH%0A++++overlapping_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+train%0A++++++++INTERSECT%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+test%0A++++%29%2C%0A++++total_unique_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+total_count%0A++++++++FROM+%28%0A++++++++++++SELECT+*+FROM+train%0A++++++++++++UNION%0A++++++++++++SELECT+*+FROM+test%0A++++++++%29+combined%0A++++%29%0ASELECT%0A++++overlap_count%2C%0A++++total_count%2C%0A++++%28overlap_count+*+100.0+%2F+total_count%29+AS+overlap_percentage%0AFROM+overlapping_rows%2C+total_unique_rows%3B%0A) for the leakage query for the IMDB dataset. | |
|  |