Things learnt from recent preparation of 1T data for testing

October 9, 2024 - 2 minutes read - 403 words

Several things learnt from recent preparation of data for testing.

Question the way when it takes too long to process data

sed: remove last n lines

In the past I always thought sed is the fastest way to manipulate data. Is it true for all the cases? I generated 1T testing data in ndjson format, however the last lines of those files are corrupted json data. I had to remove the last line. The tail command can quickly show the line there. sed will take longer time to delete the last line. Thinking of the way of tail, there must be a more efficient way to remove last line or last n lines. Efficiently remove the last two lines of an extremely large text file at superuser.com gives a python program and truncate command, truncate is my choice considing less dependencies. It takes less 1 seconds to remove last line from a 100GB-sized file instead of mins. In the end, sed is not the fastest way to delete last line or last n lines of files. Here is the way I used to remove last line.

truncate --size=-$(tail -n1 myfile | wc -m) myfile

sed: remove some lines in the middle of a file

In theory, it is possible to navigate a specific line and manipulate the block correctly in linewise way without rewriting the file. Don’t know how to do it now.

You can actually edit a file without re-writing the whole file at reddit show a way, replacing the lines with whitespace if there is no side effects on the downstream processing.

find and execute

During processing data, I noticed cpu usages and io usages are low. Considering running in a multiple core system with SSD storage, there are rooms to improve. Parallel the execution part will be a natural choice.

find . -type f -name '*.zip' -print0 |\
   parallel -0 unzip -d {/.} {}

Simple Json validataion (not schema or structure level)

PostgreSQL is slow to process json data for jsonb as it do more jobs than simple validation. A better option would be doing the validation before loading data into PostgreSQL.

jq -c '.' < test.ndjson > /dev/null
jq 'empty' < test.ndjson
# TODO: show error lines and continue processing remainding lines

Validate json and get lines with errors

> jq -R -n 'inputs | try (fromjson|empty) catch input_line_number' < test.ndjson
2
4
10
> sed -i '2d;4d;10d' test.ndjson