3. csv
in the context of panda
and csvkit
#
3.1. Basics about csv
#
CSV, for Comma-Separated Values, is a plain text format where each row represents a record, and columns are separated by commas (,). Example:
name,age,city
Alice,34,Paris
Bob,29,London
Charlie,41,Rome
CSV is a plain text (ASCII) format that makes data human-readable, easy to generate or parse in virtually any programming language (Python, R, SQL, Excel, etc.), and portable across operating systems (Windows, Unix, macOS). Because it’s just text with no dependencies, CSV files are simple to share—whether by email, version control systems like GitHub, or quick inspection with tools such as less
or cat
. They work well for small-to-medium datasets, even with millions of rows, but come with limitations: all values are treated as text until explicitly parsed, and the format lacks support for data types or nested/complex structures.
With python there are 2 ways to deal with CSV:
Built-in support: Python’s standard library includes the
csv
module, which lets you read and write CSV files without installing anything extra. It handles splitting rows into fields, quoting, delimiters, and more.With pandas: For data analysis, the
pandas
library makes CSV handling much more powerful.pandas.read_csv("file.csv")
loads data directly into aDataFrame
, automatically inferring types (numbers, strings, dates) and offering options for missing values, encodings, and delimiters. Saving back is just as easy with.to_csv()
.
Here we will use this second solution. Later we will also present a non-python handy tool called csvkit
which regroups command-line utilities for quick inspection and transformation of CSV files, often faster than writing a Python script.
3.2. Pandas DataFrame#
pandas DataFrame (df) is far better than built-in Python tools like lists, dictionaries, or arrays for the following reasons:
Tabular structure built-in
A DataFrame behaves like a table or spreadsheet: rows = records, columns = fields.
You don’t have to manually manage parallel lists or nested dictionaries.
Easy indexing and filtering: doing the same with lists/dicts would require loops and conditionals — much more verbose
Powerful aggregation & grouping: compute counts, averages, sums, or custom statistics without writing loops
Built-in handling of missing data
Pandas understands
NaN
values automatically.Built-in functions handle missing data gracefully.
Integration with plotting and analysis
Easy I/O; Load/save CSV, Excel, SQL, JSON, and more with a single command.
See infra.
Main methods on df:
Exploring the structure
df.shape # dimensions (rows, columns)
df.columns # list of column names
df.dtypes # data types of each column
df.info() # concise summary
df.head(5) # first 5 rows
df.tail(5) # last 5 rows
Inspecting the data
df.describe() # statistics (mean, std, min, max, quartiles)
df.value_counts() # count frequency of values (for a Series)
df.unique() # unique values (for a Series)
df.isnull().sum() # count missing values per column
df.sample(5) # random sample of rows
Selecting and filtering
df['col'] # select one column
df[['col1','col2']] # select multiple columns
df.loc[0] # select by label
df.iloc[0] # select by index
df[df['col'] > 10] # filter rows
Sorting and grouping
df['col'] # select one column
df[['col1','col2']] # select multiple columns
df.loc[0] # select by label
df.iloc[0] # select by index
df[df['col'] > 10] # filter rows
Modifying
df.rename(columns={'old':'new'}, inplace=True) # rename columns
df.drop(columns=['col'], inplace=True) # drop column
df.dropna() # drop rows with NaN
df.fillna(0) # fill NaN with 0
df.assign(newcol=df['col']*2) # add new column
Exporting
df.to_csv('file.csv', index=False) # save as CSV
df.to_excel('file.xlsx', index=False) # save as Excel
3.3. csvkit
the command-line Swiss Army knife#
Installation under you favorite conda environment#
conda activate python-dsspikes-env
conda install -c conda-forge csvkit # That updates the live environment.
Export the environment back to environment.yml
: Now that your environment has csvkit
, you need to reflect it in your YAML file:
conda env export --from-history > environment.yml
--from-history
ensures only the packages you explicitly installed are recorded (cleaner file, avoids tons of build hashes).
Now your environment.yml
will include csvkit in dependencies.
Basics command-line tools#
in2csv
: Excel Slayer#
converts various tabular data formats—like Excel (.xls, .xlsx), DBF, fixed-width, or even Google Sheets—into clean, standard CSV.
# Convert Excel to CSV and print to stdout
in2csv data.xlsx
# Convert Excel to CSV and save to a file
in2csv data.xlsx > data.csv
# List sheet names in an Excel file
in2csv -n data.xlsx
# Convert a specific sheet
in2csv -s "Sheet1" data.xlsx > data.csv
Convert formats, here CSV to JSON:
in2csv data/records_fake_metadata.csv | csvjson > metadata.json
csvlook
: Data Periscope#
Quickly inspect your CSV in the terminal. Allows you to display CSV files in a nicely formatted, readable table in the terminal — almost like a “pretty-printed” view of your data.
Pipe to less -S
to scroll horizontally:
csvlook data/records_fake_metadata.csv | less -S
Preview the first few rows:
csvlook data/records_fake_metadata.csv | head -n 12
csvcut
: Data Scalpel#
Select columns by name or index:
# By name
csvcut -c cell_id,protocol,prot-opt data/records_fake_metadata.csv
# By index (first column is 1)
csvcut -c 1,4,6 data/records_fake_metadata.csv
csvgrep
: Data Filter#
Filter rows based on column values:
# Keep only rows where protocol is IC
csvgrep -c protocol -m IC data/records_fake_metadata.csv
# Keep only rows where prot-opt contains "ramp"
csvgrep -c prot-opt -m ramp data/records_fake_metadata.csv
csvsort
: Data Organizer#
Sort your CSV by one or more columns:
# Sort by date
csvsort -c date data/records_fake_metadata.csv
# Sort by protocol, then by date
csvsort -c protocol,date data/records_fake_metadata.csv
csvstat
: Quick Stats#
Get basic stats and info about the CSV:
csvstat data/records_fake_metadata.csv
Combining commands#
You can pipe all the line commands. For example, get cell_id and file_name for all IC steps:
csvgrep -c protocol -m IC data/records_fake_metadata.csv \
| csvgrep -c prot-opt -m steps \
| csvcut -c cell_id,file_name \
| csvlook