Python - JI

Python

2022 年 7 月 12 日　改訂

インストール

インストールから学ぶには、以下のページを参照してください。
DNA 研究 (英語)、海洋研究 (日本語)。

ソースからビルドする

CentOS7 の場合
Python をインストールするために、ソースファイルからビルドするには、こちらが参考になります。例えば、Python-3.7.3 をインストールする場合は、以下の手順です。

下準備

いくつかの Python モジュールは、外部のパッケージに依存しているそうです。これら外部パッケージのインストールは、Python のビルドを行う前に行う必要があります。以下をインストールします。

sudo yum install zlib-devel libffi-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libuuid-devel xz-devel

Python のインストール

curl -O https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz
tar xf Python-3.7.3.tgz
cd Python-3.7.3
./configure
make
sudo make altinstall

Ubuntu の場合

Ubuntu 環境の Python に従いました。

Sheband line (#!) によるファイルの直接実行

#!/usr/bin/env python

print ("Hello world.")

test.py という名前にして保存してください．

chmod u+x sample.py

によって実行可能権をスクリプトにつけておきます．env は which と似たようなもので，python を自動的に探し出します．

[inouejun:scripts]$ test.py
Hello world.

リストの操作

>>> A=[1,2,3]
>>> B=A
>>> A[2] = 4
>>> A
[1, 2, 4]
>>> B
[1, 2, 4]

>>> B=A[:]
>>> A[2] = 5
>>> A
[1, 2, 5]
>>> B
[1, 2, 4]

>>> MyString = "abcdefg"
>>> MyList = list(MyString)
>>> MyList
['a', 'b', 'c', 'd', 'e', 'f', 'g']
>>> "".join(MyList)
'abcdefg'

>>> MyList = range(11,20)
>>> 11 in MyList
True

巨大なファイルの処理

with open("infile") as fileobject:
    for line in fileobject:
        print(line)

Fasta 形式を読み込む

#!/usr/bin/env python

from collections import OrderedDict

def readFasta_dict(InfileNameFN):
    Infile = open(InfileNameFN, "r")
    seqDictFN  = OrderedDict()
    for Line in Infile:
        Line = Line.rstrip("\n")
        if Line[0] == ">":
            Name            = Line
            seqDictFN[Name] = ""
        else:
            seqDictFN[Name] += Line
    Infile.close()
    return seqDictFN

InfileName = "my.fasta"
seqDict = readFasta_dict(InfileName)

for name, seq in seqDict.items():
    print(name)
    print(seq)

re.compile を使った方法は，こちらを参照．
readFasta.tar.gz

Fasta 形式を遺伝子ごとに分割

fasta2separate.tar.gz (2022 年 2 月)

Phylip 形式をコドンブロックに分割

phyCodon2Block.tar.gz (2018 年 2 月)

Fasta 形式を Phylip interleaved 形式に変換 - 1

fasToPhyInterleaved.tar.gz (2017 年 11 月)

Fasta 形式を Phylip interleaved 形式に変換 - 2

fas2alnTxt.tar.gz (2018 年 5 月)

Phylip 形式の match first を元に戻す

phyRead.tar.gz (2018 年　2 月)

アミノ酸配列に翻訳

def translation(dna):
    dna = dna.upper()
    protein = ""
    for codon in splitDna(dna):
        aaChr = geneticCodeMTvert.get(codon, "X")    ## <= GeneticCode
    protein = protein + aaChr
    return protein

translation.tar.gz (2018 年 2 月)

Fasta 形式を HTML 形式に変換

$ python3 fasToHTML.py

fasToHTML.tar.gz (2017 年 11 月)

パターンマッチ

line = "gi|1169025|COX1_CAEEL"
if "COX1" in line:
    print("Found")

line = ">gi|1169025|COX1_CAEEL"
if line.startswith('>'):
    print("Found")

import re
line = "ATATGTGTGAAA"
if re.search(r"AT[ATGC]TG", line):
    print("Found")

import re
line = "ATGTGTGTGAAAAA"
if re.search(r"^ATG.*A{3,}$", line):
    print("Found")

import re

protID = ">FBpp0082536"
key = ">FBpp\d+"
if re.search(key, protID):
    print("  re.search:Found")
if protID.startswith(key):
    print("  startswith:Found")
else:
    print("  startswith:NotFound")
                    
[junINOUEpro:Downloads]$ python3 test.py 
  re.search:Found
  startswith:NotFound

import re
line = "ATATGYTGTGAAA"
match = re.search(r"[^ATGC]", line)
if match:
    print("Found other than ATGC")
    otherChar = match.group()
    print(": " + otherChar)

[junINOUEpro:Downloads]$ python3 test.py 
Found other than ATGC
: Y

import re
line = "ATNATGYTGTRGAAA"
hits = re.finditer(r"[^ATGC]", line)
for hit in hits:
    chr = hit.group()
    startPos = hit.start()
    print(chr +":" + str(startPos))

[junINOUEpro:Downloads]$ python3 test2.py 
N:2
Y:6
R:10

辞書の操作

rec = {
    "Human"  : "ATGCTTG",
    "Monkey" : "TTGCTTG",
    "Cat"    : "GTGCTTG",
    "Fish"   : "CTGCTTG",
}
longestName = max(rec.keys(), key = len)
print(longestName) 
[junINOUEb:Downloads]$ python3 test.py 
Monkey


from collections import OrderedDict
rec = OrderedDict()
rec["1st"] = "ATGCTTG"
rec["2nd"] = "TTGCTTG"
rec["3rd"] = "GTGCTTG"
rec["4th"] = "CTGCTTG"
first = list(rec.keys())[0]
print(first)
[junINOUEb:Downloads]$ python3 test.py 
1st

import re

protID = "ENSACAP00000000002"
recs = {
">ENSACAP00000000002 pep1" : "ATGCTG", 
">ENSACAP00000000003 pep"  : "AGGCTG", 
">ENSACAP00000000002 pep2" : "ACGCTG", 
">ENSACAP00000000005 pep"  : "AAGCTG", 
">ENSACAP00000000006 pep"  : "ATACTG"
}
key = ">" + protID + " \w*"
hitNames = [name for name in recs.keys() if re.search(key,name)]
if len(hitNames) > 1:
    print("More than 2 hits.")
    print(hitNames)
elif len(hitNames) == 1:
    print("1 hit.")
    print(hitNames)
else:
    print("Not hit.")[junINOUEpro:Downloads]$ python3 test.py 
More than 2 hits.
['>ENSACAP00000000002 pep1', '>ENSACAP00000000002 pep2']

文字の置換

import re
line = "CGNTAGCTACTACGTGCATN"
ans = re.sub(r"[^ATGCatgc]", "-", line)
print(ans)

[junINOUEpro:Downloads]$ python3 test.py 
CG-TAGCTACTACGTGCAT-

import re
string = ">FBpp0082536 gene:xxx"
rr = re.sub(r">([^ ]+) (gene:.*$)", r">\1 A:\1 ", string)
print("rr: ", rr)

junINOUEpro:Downloads]$ python test.py
rr:  >FBpp0082536 A:FBpp0082536

line = "Anguilla japonica"
ans = line.replace(" ", "-")
print(ans)

[junINOUEpro:Downloads]$ python3 test2.py 
Anguilla-japonica

マッチした部位の抽出

import re
name = "Anguilla_australis_schmidti"
match = re.search("^([^_]+)_[^_]+_([^_]+)$", name)
if match:
    genusName = match.group(1)
    speciesName = match.group(2)
    print("genums: " + genusName)          
    print("species: " + speciesName)

[junINOUEpro:Downloads]$ python3 test2.py 
genums: Anguilla
species: schmidti

import re
line = "ATGCCGCCCGCCATTGCTGGTGGGGGGGGGGATAT"
hit = re.findall(r"[GC]{6,}", line)
print(hit)

[junINOUEpro:Downloads]$ python3 test2.py 
['GCCGCCCGCC', 'GGGGGGGGGG']

引数の処理

import sys

print('sys.argv : ', sys.argv)
print('type(sys.argv) : ', type(sys.argv))
print('len(sys.argv) : ', len(sys.argv))

print()

print('sys.argv[0] : ', sys.argv[0])
print('sys.argv[1] : ', sys.argv[1])
print('sys.argv[2] : ', sys.argv[2])
print('type(sys.argv[0]): ', type(sys.argv[0]))
print('type(sys.argv[1]): ', type(sys.argv[1]))
print('type(sys.argv[2]): ', type(sys.argv[2]))

source: sys_argv.py

変異サイトの検出 (Detection of invariable sites)

[inouejun:variableSites]$ python3 variableSites.py > out.txt

variableSites.tar.gz (2019 年 11 月)

Fasta から遺伝子配列を切り出す

retrieve_gene.tar.gz

(2020 年 7 月)

ディレクトリのリストを作成する

import os

for fileName in os.listdir("."):
    print(fileName)

他のプログラムを動かす

import subprocess

subprocess.call("date", shell=True)

他のプログラムの出力をリストで得る

import subprocess

blastLineFN = "makeblastdb -in db_nucl.txt -dbtype nucl -parse_seqids"
res_blastdbcmd_TMP = subprocess.Popen(line_blastdbcmd, stdout=subprocess.PIPE,shell=True).communicate()[0]

res_blastdbcmd = res_blastdbcmd_TMP.decode('utf-8').split("\n")

subprocess.Popen.tar.gz

bytes 型と str 型の変換についてはこちら．

(2019 年 7 月)

コマンドライン引数

import sys

argument1 = sys.argv[1]         
argument2 = int(sys.argv[2])

ディレクトリ間で内容を比較

[junINOUEpro:compare2dirs]$ python3 missFile.py
set()
{'ENSG00000000005'}
compare2dirs.tar.gz (2017 年 10 月)

ディレクトリを含むディレクトリの zip 化

#!/usr/local/bin/python3

import os, zipfile

myzip = zipfile.ZipFile('../html/test.zip','w')
files = os.listdir(path="../html/test/")
for file in files:
    print("file", file)
    myzip.write("../html/test/" + file, file)
    if os.path.isdir("../html/test/" + file):
        files_eachdir = os.listdir(path="../html/test/" + file)
        for file_eachdir in files_eachdir:
            print("file_eachdir", file_eachdir)
            myzip.write("../html/test/" + file + "/" \
            + file_eachdir, file + "/" + file_eachdir)

test_zip.tar.gz

(2022 年 7 月)

import zipfile
import glob
import os

# glob によってディレクトリ内部の .txt ファイルをリストする
files = glob.glob("test_dir/dir_A/*.txt")

# dir_A ディレクトリ内部に圧縮ファイル.zipを作成
f     = zipfile.ZipFile("test_dir/dir_Az.zip","w", 
                         zipfile.ZIP_DEFLATED )

# os.path.basename によって，保存ファイル名からパス部分を削除
for file in files:
    f.write (file, os.path.basename(file))
f.close()

python_glob.tar.gz (2018 年 2 月)

こちらを参照しました．
- Programering: Python Notes file compression
- Qiita: フォルダ内のリストを取得する

Newick: クレードをリストに格納する

import re

tree = '(4,(3,(2,1)));'
#tree = '(4,((3,5),(2,1)));'
#tree = '(((4,6),7),(3,5),(2,1));'

cladeReg = "\(([^\(\)]+)\)"

clades = []
while re.search(cladeReg, tree):
    match = re.search(cladeReg, tree)
    clades.append(match.group())
    tree = re.sub(cladeReg, r"\1", tree, 1)
    
for clade in clades:
    print(clade)

[junINOUEpro:tree]$ python3 cladeCollect.py
(2,1)
(3,2,1)
(4,3,2,1)

nodesCollectFromNewickPYTHON.tar.gz (2018 年 1 月)

系統樹の順番に配列を並べなおす

sort_seq_by_tree.tar.gz
(2020 年 7 月)

Debugger

コマンドラインで指定する

python -m pdb test.py

その後，b で breakpoint を設定し，c でその場所に移動します．コメントアウトした行には breakpoint を打てないです．こちらを参照してください．

プログラムに書き込む

import pdb

line = "gi|1169025|COX1_CAEEL"
pdb.set_trace()
if "COX1" in line:
    print("Found")

[junINOUEpro:Downloads]$ python3 test.py
> /Users/junINOUEpro/Downloads/test.py(5)<module>()
-> if "COX1" in line:

(Pdb) l
1 import pdb
2
3 line = "gi|1169025|COX1_CAEEL"
4 pdb.set_trace()
5 -> if "COX1" in line:
6 print("Found")
[EOF]

(Pdb) p line
'gi|1169025|COX1_CAEEL'

(Pdb) n
> /Users/junINOUEpro/Downloads/test.py(6)<module>()
-> print("Found")

(Pdb) n
Found
--Return--
> /Users/junINOUEpro/Downloads/test.py(6)<module>()->None
-> print("Found")

R と連携する

python3 control.py

PypeR_NJ.tar.gz
PypeR を使って R に NJBS.R を読み込ませます．NJBS.R は BS 付き NJ tree を推定します．PyperR は，パッケージマネージャー pip を用いてインストールする必要があります．

pip install pyper

こちらを参照しました (2017 年 12 月)．

numpy がインストール済みか確認

[cluster:~]$ cat test_pythonLibraries.py
import numpy as np
import scipy as sp
print("np.__version__")
print(np.__version__)
print("sp.__version__")
print(sp.__version__)

[cluster:~]$ python3 test_pythonLibraries.py
np.__version__
1.14.3
sp.__version__
1.0.0

リンク

Python 入門

fasta 形式の開き方など (P16,P77)．Python のコードを Perl に照らし合わせて文法を解説．リストとタプルの違いがわかりやすい．

Practical computing for biologists

Python を用いたバイオインフォマティクスの実践方法を紹介した本．TextWranger の使い方など，基礎的なことが丁寧に書かれています．本の website に質問コーナーがあるようですが，最近 (2017 年 11 月) は投稿がないみたいです．

Python for Biologists

・DNA 配列の操作を例とした，python 入門書 (Amazon) のウェブサイト．
・各章の最後に練習問題と丁寧な回答がある．
・右下にある PYTHON ARTICLES から，サイト内部を検索できる (例 sys.argv)．
・PROGRAMMING ARTICLES に役立つコードがある．

Advanced Python for Biologists