Getting started¶

Each file path in Path2Insight is converted into a WindowsFilePath or PosixFilePath object. Since Python 3.4, there is a module to analyze and manipulate file paths in Python named pathlib. Path2Insight extends the PurePath class in this module. Path2Insight has the classes PureWindowsPath and PurePosixPath. These two classes play an important role in the package and are the input of nearly each function.

As described before, there are objects for Windows file paths and objects for Posix (Linux, macOS) file paths. This is because the file paths on Windows and Posix are different. A typical (absolute) path in Windows looks like: K:\Datasets\Climate\dataset_climate_change.csv. A typical (absolute) path in Linux looks like: /var/Datasets/Climate/dataset_climate_change.csv. A typical (absolute) path in macOS looks like: /Volumes/Datasets/Climate/dataset_climate_change.csv. Windows file paths start with a drive letter and uses backslashes as separators. Posix file paths use a slash as root and and slashes as separators.

FilePath Objects¶

Load the path2insight.WindowsFilePath and path2insight.PosixFilePath objects.

[2]:

from path2insight import WindowsFilePath, PosixFilePath

Path2Insight provides the possibility to analyze the file path deeply. This can be done by converting the string-type file path into a WindowsFilePath-type or PosixFilePath-type object. One can make a WindowsFilePath object of the string-type file path by passing it as an argument to the class. It is also possible to provide pass the file path in parts.

[3]:

print(WindowsFilePath('K:\Datasets\Climate\dataset_climate_change.csv'))
print(WindowsFilePath('K:\\', 'Datasets', 'Climate', 'dataset_climate_change.csv'))

K:\Datasets\Climate\dataset_climate_change.csv
K:\Datasets\Climate\dataset_climate_change.csv

For a Posix type file path, one uses the PosixFilePath class.

[4]:

print(PosixFilePath('/var/Datasets/Climate/dataset_climate_change.csv'))

/var/Datasets/Climate/dataset_climate_change.csv

It is also possible to use relative file paths. This is done exactly the same way as with absolute file paths.

[5]:

print(WindowsFilePath('Climate', 'dataset_climate_change.csv'))
print(WindowsFilePath('..', 'Climate', 'dataset_climate_change.csv'))

Climate\dataset_climate_change.csv
..\Climate\dataset_climate_change.csv

To convert the WindowsFilePath or PosixFilePath back to

Methods and attributes¶

Converting the file path from a string into a WindowsFilePath or PosixFilePath object gives you large number of functionalities. One can split the file path into parts, get the extensions, lower or upper the stem. See the documentation of WindowsFilePath and PosixFilePath for all the attributes and methods. The following example shows some of the features.

[6]:

path = WindowsFilePath('K:\Datasets\Climate\dataset_climate_change.csv')
path.parts
# ('K:\\', 'Datasets', 'Climate', 'dataset_climate_change.csv')
path.drive
# 'K:'
path.lower()
# WindowsFilePath('k:/datasets/climate/dataset_climate_change.csv')
path.stem
# 'dataset_climate_change'
path.extension
# '.csv'
path.upper_stem()
# WindowsFilePath('K:/Datasets/Climate/DATASET_CLIMATE_CHANGE.csv')
path.name
# 'dataset_climate_change.csv'
path.name.upper();
# 'DATASET_CLIMATE_CHANGE.csv'

Note that some of the methods do return a WindowsFilePath object while other do not. It depends on your application what the preferred method is. Take a look at the following two examples:

[7]:

str(path).lower()

[7]:

'k:\\datasets\\climate\\dataset_climate_change.csv'

[8]:

str(path.lower())

[8]:

'k:\\datasets\\climate\\dataset_climate_change.csv'

They look the same, but the first one is a string while the second one is a WindowsFilePath object (see the cell below).

[9]:

print(type(str(path).lower()))
print(type(path.lower()))

<class 'str'>
<class 'path2insight.core.WindowsFilePath'>

The same for the when using parts. See below:

[10]:

# the following line return a tuple with parts
path.lower().parts
# ('k:\\', 'datasets', 'climate', 'dataset_climate_change.csv')

[10]:

('k:\\', 'datasets', 'climate', 'dataset_climate_change.csv')

[11]:

# while the following raises an error
'K:\Datasets\Climate\dataset_climate_change.csv'.lower().parts

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-37be527e591a> in <module>()
      1 # while the following raises an error
----> 2 'K:\Datasets\Climate\dataset_climate_change.csv'.lower().parts

AttributeError: 'str' object has no attribute 'parts'

There is a growing list of methods available for the WindowsFilePath and PosixFilePath objects. At the moment, nearly all methods and attributes available in the WindowsFilePath object are also available in the PosixFilePath class. The FilePath objects inherit methods and attributes from the PurePath object of the pathlib module (new in Python 3.4). See DOCUMENTATION_LINK for the reference for WindowsFilePath and PosixFilePath.

Collections of file paths¶

The previous section shows how to extract information from a single file path. Path2Insight is optimized to analyze large collections of file paths. One can analyze collections by using list comprehensions. The Python documentation has a clear section about this topic which you can find here. Path2Insight follows the Natural Language Toolkit (NLTK) with this API structure.

For this section we use a the following data:

[12]:

from path2insight import WindowsFilePath, PosixFilePath

import path2insight

paths = [
    'K:\Datasets\Climate\dataset_climate_change.csv',
    'K:\Datasets\Climate\dataset_energy_consumption.csv',
    'K:\Datasets\Climate\dataset_energy_consumption.xlsx',
    'K:\Datasets\Climate\climate_change.py',
    'K:\Datasets\Climate\README'
]

The first step is to convert the list of file paths into WindowsFilePath objects. This is done with a list comprehension (see cell below) or with the parse function in path2insight.parse(windows_paths, 'windows').

[13]:

windows_paths = [WindowsFilePath(path) for path in paths]
windows_paths

[13]:

[WindowsFilePath('K:/Datasets/Climate/dataset_climate_change.csv'),
 WindowsFilePath('K:/Datasets/Climate/dataset_energy_consumption.csv'),
 WindowsFilePath('K:/Datasets/Climate/dataset_energy_consumption.xlsx'),
 WindowsFilePath('K:/Datasets/Climate/climate_change.py'),
 WindowsFilePath('K:/Datasets/Climate/README')]

By using list-comprehensions, one can extract information from the file paths. For example, extract the extension for each path:

[14]:

[path.extension for path in windows_paths]

[14]:

['.csv', '.csv', '.xlsx', '.py', '']

Or a more advanced example where the tokens of the file path stems are collected:

[15]:

[token for path in windows_paths for token in path.tokenize_stem()]

[15]:

['dataset',
 'climate',
 'change',
 'dataset',
 'energy',
 'consumption',
 'dataset',
 'energy',
 'consumption',
 'climate',
 'change',
 'README']

One can use the power of Python’s Counter class to count the tokens.

[16]:

from collections import Counter

Counter([token for path in windows_paths for token in path.tokenize_stem()])

[16]:

Counter({'README': 1,
         'change': 2,
         'climate': 2,
         'consumption': 2,
         'dataset': 3,
         'energy': 2})

Selecting, sampling and sorting file paths¶

Path2Insight contains functions to handle collections of file paths (stored in lists). These functions make it easy to subset and select parts of the data. In this section shows some examples on using the file path handling functions. The functions described in this section cover sampling, subsetting and sorting.

First, load the demo dataset Ensembl

[17]:

from path2insight.datasets import load_ensembl

data = load_ensembl()
data[:5]

[17]:

[PosixFilePath('/Volumes/release-90/README'),
 PosixFilePath('/Volumes/release-90/xml/ensembl-compara/homologies/README.gene_trees.xml_dumps.txt'),
 PosixFilePath('/Volumes/release-90/xml/ensembl-compara/homologies/MD5SUM'),
 PosixFilePath('/Volumes/release-90/xml/ensembl-compara/homologies/Compara.90.protein_murinae.tree.phyloxml.xml.tar.gz'),
 PosixFilePath('/Volumes/release-90/xml/ensembl-compara/homologies/Compara.90.protein_murinae.tree.orthoxml.xml.tar.gz')]

[18]:

# sample 5 paths from the data
path2insight.sample(data, 5)

[18]:

[PosixFilePath('/Volumes/release-90/mysql/ensembl_mart_90/lafricana_gene_ensembl__homolog_jjaculus__dm.txt.gz'),
 PosixFilePath('/Volumes/release-90/mysql/ensembl_mart_90/gmorhua_gene_ensembl__homolog_tguttata__dm.txt.gz'),
 PosixFilePath('/Volumes/release-90/mysql/ensembl_mart_90/mspreteij_gene_ensembl__homolog_oprinceps__dm.txt.gz'),
 PosixFilePath('/Volumes/release-90/mysql/ensembl_mart_90/ttruncatus_gene_ensembl__protein_feature_prints__dm.txt.gz'),
 PosixFilePath('/Volumes/release-90/gff3/mus_caroli/Mus_caroli.CAROLI_EIJ_v1.1.90.chr.gff3.gz')]

[19]:

# make a subset of paths with 'fasta' as the third level and 'dna' as the fifth level.
data_subset = path2insight.select(data, level3='fasta', level5='dna')
data_subset[:5]

[19]:

[PosixFilePath('/Volumes/release-90/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.fa.gz'),
 PosixFilePath('/Volumes/release-90/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.nonchromosomal.fa.gz'),
 PosixFilePath('/Volumes/release-90/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa.gz'),
 PosixFilePath('/Volumes/release-90/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.nonchromosomal.fa.gz'),
 PosixFilePath('/Volumes/release-90/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa.gz')]

[20]:

# default list sort method works also for the WindowsFilePath
# and PosixFilePath objects. This method sorts the data inplace.
data.sort()

[21]:

# same as data.sort() but not inplace
data_sorted = path2insight.sort(data)
data_sorted[:5]

[21]:

[PosixFilePath('/Volumes/release-90/README'),
 PosixFilePath('/Volumes/release-90/fasta/ciona_savignyi/dna_index/CHECKSUMS'),
 PosixFilePath('/Volumes/release-90/fasta/ciona_savignyi/dna_index/Ciona_savignyi.CSAV2.0.dna.toplevel.fa.gz'),
 PosixFilePath('/Volumes/release-90/fasta/ciona_savignyi/dna_index/Ciona_savignyi.CSAV2.0.dna.toplevel.fa.gz.fai'),
 PosixFilePath('/Volumes/release-90/fasta/ciona_savignyi/dna_index/Ciona_savignyi.CSAV2.0.dna.toplevel.fa.gz.gzi')]

In the following example, the data is sorted on level 5 first and level 4 second.

[22]:

data_sorted_advanced = path2insight.sort(data, level=[5, 4])
data_sorted_advanced[:5]

[22]:

[PosixFilePath('/Volumes/release-90/README'),
 PosixFilePath('/Volumes/release-90/gff3/ailuropoda_melanoleuca/Ailuropoda_melanoleuca.ailMel1.90.abinitio.gff3.gz'),
 PosixFilePath('/Volumes/release-90/gtf/ailuropoda_melanoleuca/Ailuropoda_melanoleuca.ailMel1.90.abinitio.gtf.gz'),
 PosixFilePath('/Volumes/release-90/tsv/ailuropoda_melanoleuca/Ailuropoda_melanoleuca.ailMel1.90.ena.tsv.gz'),
 PosixFilePath('/Volumes/release-90/tsv/ailuropoda_melanoleuca/Ailuropoda_melanoleuca.ailMel1.90.entrez.tsv.gz')]