3. csv in the context of panda and csvkit#

3.1. Basics about csv#

CSV, for Comma-Separated Values, is a plain text format where each row represents a record, and columns are separated by commas (,). Example:

name,age,city
Alice,34,Paris
Bob,29,London
Charlie,41,Rome

CSV is a plain text (ASCII) format that makes data human-readable, easy to generate or parse in virtually any programming language (Python, R, SQL, Excel, etc.), and portable across operating systems (Windows, Unix, macOS). Because it’s just text with no dependencies, CSV files are simple to share—whether by email, version control systems like GitHub, or quick inspection with tools such as less or cat. They work well for small-to-medium datasets, even with millions of rows, but come with limitations: all values are treated as text until explicitly parsed, and the format lacks support for data types or nested/complex structures.

With python there are 2 ways to deal with CSV:

  • Built-in support: Python’s standard library includes the csv module, which lets you read and write CSV files without installing anything extra. It handles splitting rows into fields, quoting, delimiters, and more.

  • With pandas: For data analysis, the pandas library makes CSV handling much more powerful. pandas.read_csv("file.csv") loads data directly into a DataFrame, automatically inferring types (numbers, strings, dates) and offering options for missing values, encodings, and delimiters. Saving back is just as easy with .to_csv().

Here we will use this second solution. Later we will also present a non-python handy tool called csvkit which regroups command-line utilities for quick inspection and transformation of CSV files, often faster than writing a Python script.

3.2. Pandas DataFrame#

pandas DataFrame (df) is far better than built-in Python tools like lists, dictionaries, or arrays for the following reasons:

  • Tabular structure built-in

    • A DataFrame behaves like a table or spreadsheet: rows = records, columns = fields.

    • You don’t have to manually manage parallel lists or nested dictionaries.

  • Easy indexing and filtering: doing the same with lists/dicts would require loops and conditionals — much more verbose

  • Powerful aggregation & grouping: compute counts, averages, sums, or custom statistics without writing loops

  • Built-in handling of missing data

    • Pandas understands NaN values automatically.

    • Built-in functions handle missing data gracefully.

  • Integration with plotting and analysis

  • Easy I/O; Load/save CSV, Excel, SQL, JSON, and more with a single command.

See infra.

Main methods on df:

  • Exploring the structure

    df.shape          # dimensions (rows, columns)
    df.columns        # list of column names
    df.dtypes         # data types of each column
    df.info()         # concise summary
    df.head(5)        # first 5 rows
    df.tail(5)        # last 5 rows
  • Inspecting the data

   df.describe()         # statistics (mean, std, min, max, quartiles)
   df.value_counts()     # count frequency of values (for a Series)
   df.unique()           # unique values (for a Series)
   df.isnull().sum()     # count missing values per column
   df.sample(5)          # random sample of rows
  • Selecting and filtering

df['col']             # select one column
df[['col1','col2']]   # select multiple columns
df.loc[0]             # select by label
df.iloc[0]            # select by index
df[df['col'] > 10]    # filter rows
  • Sorting and grouping

df['col']             # select one column
df[['col1','col2']]   # select multiple columns
df.loc[0]             # select by label
df.iloc[0]            # select by index
df[df['col'] > 10]    # filter rows
  • Modifying

df.rename(columns={'old':'new'}, inplace=True)  # rename columns
df.drop(columns=['col'], inplace=True)          # drop column
df.dropna()                                     # drop rows with NaN
df.fillna(0)                                    # fill NaN with 0
df.assign(newcol=df['col']*2)                   # add new column
  • Exporting

df.to_csv('file.csv', index=False)  # save as CSV
df.to_excel('file.xlsx', index=False) # save as Excel

3.3. csvkit the command-line Swiss Army knife#

Installation under you favorite conda environment#

conda activate python-dsspikes-env
conda install -c conda-forge csvkit # That updates the live environment.

Export the environment back to environment.yml: Now that your environment has csvkit, you need to reflect it in your YAML file:

conda env export --from-history > environment.yml

--from-history ensures only the packages you explicitly installed are recorded (cleaner file, avoids tons of build hashes). Now your environment.yml will include csvkit in dependencies.

Basics command-line tools#

in2csv: Excel Slayer#

converts various tabular data formats—like Excel (.xls, .xlsx), DBF, fixed-width, or even Google Sheets—into clean, standard CSV.

# Convert Excel to CSV and print to stdout
in2csv data.xlsx

# Convert Excel to CSV and save to a file
in2csv data.xlsx > data.csv

# List sheet names in an Excel file
in2csv -n data.xlsx

# Convert a specific sheet
in2csv -s "Sheet1" data.xlsx > data.csv

Convert formats, here CSV to JSON:

in2csv data/records_fake_metadata.csv | csvjson > metadata.json

csvlook: Data Periscope#

Quickly inspect your CSV in the terminal. Allows you to display CSV files in a nicely formatted, readable table in the terminal — almost like a “pretty-printed” view of your data. Pipe to less -S to scroll horizontally:

csvlook data/records_fake_metadata.csv | less -S

Preview the first few rows:

csvlook data/records_fake_metadata.csv | head -n 12

csvcut: Data Scalpel#

Select columns by name or index:

# By name
csvcut -c cell_id,protocol,prot-opt data/records_fake_metadata.csv

# By index (first column is 1)
csvcut -c 1,4,6 data/records_fake_metadata.csv

csvgrep: Data Filter#

Filter rows based on column values:

# Keep only rows where protocol is IC
csvgrep -c protocol -m IC data/records_fake_metadata.csv

# Keep only rows where prot-opt contains "ramp"
csvgrep -c prot-opt -m ramp data/records_fake_metadata.csv

csvsort: Data Organizer#

Sort your CSV by one or more columns:

# Sort by date
csvsort -c date data/records_fake_metadata.csv

# Sort by protocol, then by date
csvsort -c protocol,date data/records_fake_metadata.csv

csvstat: Quick Stats#

Get basic stats and info about the CSV:

csvstat data/records_fake_metadata.csv

Combining commands#

You can pipe all the line commands. For example, get cell_id and file_name for all IC steps:

csvgrep -c protocol -m IC data/records_fake_metadata.csv \
  | csvgrep -c prot-opt -m steps \
  | csvcut -c cell_id,file_name \
  | csvlook