善用python第三方库:如何精确统计海南岛台风登陆点热点以及强度

海南岛由于其独特的地理位置,每年遭受大量来自南海或热带西太平洋的台风影响,素有“南海台风走廊”之称。建立海南岛的风暴潮模型对预防台风带来的风暴潮灾害有重要意义,而模型由台风风场驱动,如何根据不同的台风登陆点、来向、强度等因素设计敏感性试验很重要,因而我们需要根据历史资料对登陆海南岛的台风进行统计和分类。假设我们已整理好历年的台风6h数据,包括路径的经纬、强度、风速等,如何统计海南岛的台风登陆点分布呢?

首先,最容易想到的,也是前人使用较多的方法是,在研究区域及周边建立一个粗网格(如图1a所示),然后将台风路径插值到这些网格点中,对每个网格点进行统计计数,得到所有台风经过点的热力图,再根据这个图分析海南岛沿岸的登陆热点情况。这种方法虽然简单,但是缺点也很明显:过于粗糙、二次登陆和台风登出点均会对结果造成很大的误差。第二种方法是用几个矩形框将海南岛沿岸分为几个区域(如图1b),将台风路径进行加密插值,然后对每个台风路径按时间顺序遍历,判断路径点是否落在几个矩形区域内,若在区域内,进行记录,然后跳入下一个台风路径的遍历。这种方法可以避免对一次台风的多次统计,同时也对区域进行了划分,可以统计到台风的登陆热点分布情况。然而,这种方法依然不是我们想要的,不够精确,也不够优雅。

第二种方法里有“判断点在区域内”的思路,不过是判断矩形区域的,只需判断经纬坐标的范围就行,方法比较简单。我们可以在这基础上扩展一下思路:如何判断点在多边形内(Polygon Inside)。判断点在多边形内有几种方法,面积判别、夹角判别或射线法,但是自己再写这个算法显然太麻烦,这么常见的功能应该有python第三方包。上网查阅资料后发现Shapely(https://github.com/Toblerity/Shapely)符合我们的要求,它是用于处理和分析平面几何对象的一个python第三方库,基于广泛运用的GEOS(PostGIS的引擎)和JTS库,具有强大的平面几何图形(点、线、多边形等)处理和运算功能。有了Shapely,我们还需要海南岛的轮廓线。GADM(http://www.gadm.org/)提供全球的行政区划数据库,包括了几乎所有国家和地区的国界、子级行政区域划分数据,其提供的shapefile格式数据可以直接用于python的Basemap包(https://matplotlib.org/basemap/)的绘图中,也包含了所有区域轮廓的经纬坐标点。GADM提供的中国行政区域划分中分为0,1,2,3级数据,对应国、省、县(市)、乡的行政区划精度。在获得以上的数据和扩展包之后,我们便可以对台风进行登陆点统计。

首先,从GADM中国1级数据中获取海南岛的轮廓线,其包含35749个坐标点,数据量过于庞大,我们需要先用Shapely的Polygon生成海南岛轮廓线的多边形,剔除掉面积过小的区划(海南周边海岛),然后用.simplify(0.05)将多边形进行简化为49个坐标点的多边形(如图1c)。对每个台风,将台风路径用Shapely的LineString生成折线段,用Polygon.intersection与LineString进行相交得到交点。通过试验得知,Shapely中Polygon与LineString类的Intersection是会考虑LineString的坐标顺序的,因而只要LineString中的坐标顺序是按台风路径的时间顺序,我们只需要取intersection的第一个交点即可。得到所有台风的登陆点位置后,我们用GADM的3级数据的乡级行政区划来进行统计(图1d),用每个区划的轮廓经纬生成Polygon,用Polygon.contains(Point)来判断登陆点是否在区域内。但是,由于我们上面生成的海南岛的轮廓线是简化过的,登陆点可能落在区域外,因而,我们需要换个思路:用Point.buffer(0.04)将台风登陆点稍微地放大成圆形区域,然后再用Polygon.intersects(..)判断区域与圆形是否相交,若相交,则进行统计,并在接下来的区划遍历中移除该台风(防止buffered后的点与多个区划相交造成重复统计)。最后,使用Basemap将台风登陆点的分布画出,使用windrose的python包(https://github.com/python-windrose/windrose)将统计得到的数据进行展示(图2展示台风强度的风玫瑰图)。

总结:Python是一门优雅且灵活的脚本编程语言,其数量众多的丰富成熟的第三方库可以大大提高我们的工作效率。在解决实际工作问题的时候,要善于在网上寻找是否有第三方库可提供成熟的算法或解决方案,阅读这些库的Api说明文档,从中找到可行的解决路线。同样,互联网上也有许多公开的数据库或数据集,有效的利用这些数据也可以使我们的工作更加精确和高效。

Python脚本下台风数据

从浙江水利厅的台风路径网下的,但是数据和中国气象网是一样的。

import requests
import json
import os
from urllib import parse as URL_PARSE

url_search = r"http://typhoon.zjwater.gov.cn/Api/TyphoonSearch/"
url_info   = r"http://typhoon.zjwater.gov.cn/Api/TyphoonInfo/"
output_path= r"./"

def getTYinfo(str):
    r = requests.get(url_info + str[:6])
    if r.status_code == 200:
        data = r.json()
        savePath = os.path.join(output_path + str + ".json")
        print("Typhoon Info ", str, "get, save in " + savePath)
    else:
        print("Typhoon Info ", str, "download failed.")
    
def main():
    s = input("Input search information: ")
    if not s:
        print("Incorrect input. Quiting")
        return
    r = requests.get(url_search + URL_PARSE.quote(s, encoding="GBK"))
    info = r.json()
    for i, sg  in enumerate(info):
        print(i, sg['name'])
    s = input("Select ids(seperate with [Space]):").split()
    spl = [int(v) for v in s]
    for i in spl:
        try:
            getTYinfo(info[i]['name'])
        except Exception as e:
            pass

if __name__ == "__main__":
    main()

shapely&非结构化网格:随机生成点产生三角形并融合

简单弄了稀疏化非结构网格的算法,再准备产生结构化网格点进行插值的时候碰要得到网格边界的问题,用shapely.union解决,下面用随机生成点进行示例。

import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import MultiPoint
from shapely.ops import triangulate
from descartes.patch import PolygonPatch
from functools import reduce

points_num = 15
x = np.random.rand(points_num)
y = np.random.rand(points_num)
points = MultiPoint(list(zip(x, y)))
triangles = triangulate(points)
plt.figure()
for triangle in triangles:
    patch = PolygonPatch(triangle, alpha=0.5, zorder=2)
    plt.gca().add_patch(patch)
for point in points:
    plt.plot(point.x, point.y, 'o', color='grey')
plt.xlim((0,1))
plt.ylim((0,1))
    
poly = reduce(lambda x, y: x.union(y), triangles)
plt.figure()
plt.gca().add_patch(PolygonPatch(poly, alpha=0.2))
outline = poly.boundary.coords.xy
plt.plot(*outline, 'r')
plt.xlim((0,1))
plt.ylim((0,1))

Image copy from Clipboard

The win32clipboard module provides method of getting content from clipboard. But, when it comes to image content, bytes data are returned and are hard to convert to image type(involving Bitmap Header and ‘PIL.Image.frombytes() ‘ requires size though is unknown).

However, PIL module offers function can grab image from clipboard.

# from PIL/ImageGrab.py
...
def grabclipboard():
    if sys.platform == "darwin":
        fh, filepath = tempfile.mkstemp('.jpg')
        os.close(fh)
        commands = [
            "set theFile to (open for access POSIX file \""+filepath+"\" with write permission)",
            "try",
                "write (the clipboard as JPEG picture) to theFile",
            "end try",
            "close access theFile"
        ]
        script = ["osascript"]
        for command in commands:
            script += ["-e", command]
        subprocess.call(script)

        im = None
        if os.stat(filepath).st_size != 0:
            im = Image.open(filepath)
            im.load()
        os.unlink(filepath)
        return im
    else:
        data = Image.core.grabclipboard()
        if isinstance(data, bytes):
            from . import BmpImagePlugin
            import io
            return BmpImagePlugin.DibImageFile(io.BytesIO(data))
        return data

So, ‘BmImagePlugin.DibImageFile()’ is exactly what we need.

import win32clipboard as w
from PIL import Image, BmpImagePlugin
import io
#
w.OpenClipboard()
if w.IsClipboardFormatAvailable(w.CF_DIB):
    c = w.GetClipboardData(w.CF_DIB)
    im = BmpImagePlugin.DibImageFile(io.BytesIO(c))
    im.show() # or other operations
w.CloseClipboard()

How to download NASA Global Precipitation Measurement(GPM) DATASET

1Prepare an account of https://urs.earthdata.nasa.gov/ and make sure have the “NASA GESDISC DATA ARCHIVE” apps in your approved applications list. Here’s the guidance: https://disc.gsfc.nasa.gov/earthdata-login

2Search and choose gpm data you need in: https://disc.gsfc.nasa.gov/datasets . Click [Get Data] and set refining options, then press [Get Data] to get the download links list.

3The urls in download list file is not accessible directly, you need your account registed before. The downloading requires authentication but the HTTPAuth will failed(redirection reason or others), which means you have to get the authorized cookies before you multiply request the urls. For pythoner, you can use web browser to open a download link and login first and then you can get the authorized cookies from the request headers.

And code will be like this:

r = requests.get(url, cookies={'nasa_gesdisc_data_archive':'xxxxxx'}) 
f = open(filepath, 'wb')
f.write(r.content)
f.close()

4But the download speed of python ‘requests’ module is comparable with a turtle. So below gives a ‘wget’ method which is quite faster(for Mac/Linux).

  1. Create a .netrc file in your home directory to save the login message for wget.
    • echo “machine urs.earthdata.nasa.gov login <username> password <password>” >> .netrc
    • chmod 0600 .netrc #(to be safe)
  2. Create a cookie file to record cookies and persist sessions across wget.(This can be any where in case refer a correct path in wget command.
    • touch .urs_cookies
  3. Download multiple data files using wget -i:
    • wget –load-cookies .urs_cookies –save-cookies .urs_cookies –auth-no-challenge=on –keep-session-cookies –content-disposition -i <list.txt> #(list txt contains urls for download files).

DONE.

多线程批量下载ASCAT风场数据

ftp进不去,只能用http接口这样子。
主页:http://www.remss.com/missions/ascat/
说明:http://data.remss.com/ascat/metopa/readme_ASCAT_metopA.txt
bytemap格式数据:http://data.remss.com/ascat/metopa/bmaps_v02.1/

from bs4 import BeautifulSoup
import requests
import threading
import os
import time

threadslimit = 50
global count
count = 0

domain = r"http://data.remss.com"
urlprefix = r"/ascat/metopa/bmaps_v02.1/y2018/m"

def download(url):
    global count
    count += 1
    id = count
    filename = url.split('/')[-1]
    print("【%03d】Downloading: %s" % (id, filename))
    r = requests.get(url, stream=True)
    f = open(os.path.join(os.getcwd(),'download', filename), 'wb')
    for chunk in r.iter_content(chunk_size=1024*1024):
        if chunk:
            f.write(chunk)
    f.close()
    print("【%03d】download completed." % id)
  
threads = []
for m in range(1, 13):
    url = domain + urlprefix + "%02d" % m
    r = requests.get(url)
    r.encoding = "utf-8"
    soup = BeautifulSoup(r.text)
    print(m)
    for item in soup.select("a"):
        if item['href'].endswith('1.gz'):
            downlink = domain + item['href']
            threads.append(threading.Thread(target=download, args=(downlink, )))
    r.close()

print('Threads initialization complited, downloading...')
for t in threads:
    t.start()
    while True:
        if(len(threading.enumerate()) <= threadslimit):
            break
        time.sleep(0.1)

Linux bash回收站

之前刚把组里服务器重新弄好的时候,受师兄启发想写个回收站,虽然shell script基本不会,但是搞了一两天还是怼出来了一个,然后过了几个月,现在基本看不懂当时写的啥了 😛

刚传上 https://github.com/nizekai/recyclebin

思路也挺简单的,就是用alias把真正的rm覆盖,然后每次删除就把文件移到$HOME/.trash/,做好日期和路径记录,然后恢复的时候就让用户选择文件恢复。如果参数里带了-f就转而用会真正的rm删除,测试效果挺好的。

省事的就丢/etc/profil.d/里,然后重新source或登录,rcyclbin enable 开启,recyclbin disable 关闭,默认开启可以写进~/.bashrc 或者 /etc/profile

例子:

[nzk@xxxx ~]$ touch testa testb
[nzk@xxxx ~]$ rm testa testb
Are you sure to delete the files to recyclebin?[y/n]y
testa      delete to rcycbin
testb      delete to rcycbin
[nzk@xxxx ~]$ rr -l
---List for recycle bin---
1:      testa from /home/nzk  as testa_2018-12-14_20:48:45
2:      testb from /home/nzk  as testb_2018-12-14_20:48:45
[nzk@xxxx ~]$ rr 1
Are u sure to recovery testa to /home/nzk?[y/n]y
testa has been resotred to /home/nzk
[nzk@xxxx ~]$

Code

# PROGRAM:
#   recycle bin   
# Usage: 
#   Use "rcyclbin enable" to enable this function, it will cover the "rm" command.
#       "rcyclbin disable" to disable this of course.
#   It will make a new dir ".trash" in $HOME dir for each user, storing every file that deleted by "rm" command.
#   Use "cleartrash"  to clear the recycle bin.
#   Use "rl"          to list for recycle bin.
#   Use "rr filename" to recovery file.
# History: 
#   2018/08/30  nzk     First release
# to be improved
realrm="/bin/rm"
del () {
    usage () {
        echo "Usage: rm [-fh] file1 [file2 ...]"
        echo "          -h for usage"
        echo "          If '-f' used, then the script will exec "
        echo "          real 'rm' command with full command directly"
        echo "       rr -l to see file in recyclebin"
    }
    local confirm currentTime filename filepath key
    local optflag=0
    if [[ "$#" -eq 0 ]]; then
        usage
        optflag=1
    fi
    if [[ "$1" =~ ^-.*f.*$ ]]; then
        optflag=1
        $($realrm "$@")
    fi
    if [[ "$1" == "-h" || "$1" == "--help" ]]; then
        optflag=1
        usage
    fi
    if [[ "$optflag" -eq 0 ]]; then
        read -p "Are you sure to delete the files to recyclebin?[y/n]" confirm
        if [[ "$confirm" == "y" || "$confirm" == "Y" ]]; then 
            if [[ ! -d "$HOME/.trash/" ]]; then
                mkdir ~/.trash
            fi
            if [[ -n "$*" ]]; then
                for file in "$@"; do
                    if [[ -e "$file" ]]; then
                        currentTime=`date +%Y-%m-%d_%H:%M:%S`
                        filename="${file##*/}_${currentTime}"
                        [ -d "$file" ] && filename="$filename/" && file="$file/"
                        filepath=$(cd "$(dirname "$(pwd)/$1")";pwd)
                        if mv -f $file $HOME/.trash/$filename ; then
                            [ ! -e "$HOME/.trash/.log" ] && touch ~/.trash/.log
                            printf "%-20s %-20s %-40s\n" $file $filepath $filename >> ~/.trash/.log
                            printf "%-10s delete to recyclebin\n" $file
                        fi 
                    fi
                done
            fi
        fi
    fi
    unset -f usage
}

recoveryfile () {
    usage () {
        echo "Usage:     -h for help"
        echo "           -l to list record in recyclebin"
        echo "           append file number that listed to recovery"
    }
    list () {
        if [[ $# -eq 0 ]]; then
            echo "---List for recycle bin---"
            [ -e "$HOME/.trash/.log" ] && awk '{printf "%d: %10s from %-10s as %-20s\n",NR,$1,$2,$3}' ~/.trash/.log 
        else
            echo "Hell is empty"
        fi
    }
    if [[ $# -eq 0 ]]; then
        usage
    fi
    local key
    local optflag=0
    if [[ "$*" =~ ^-.*$ ]]; then
        optflag=1
        while [[ $# -gt 0 ]]; do
            key="$1"
            case $key in
                -h)
                    usage
                    shift
                    ;;
                -l)
                    shift
                    list $@
                    shift
                    ;;
                *)
                    shift
            esac
        done
    fi
    if [[ "$optflag" -eq 0 && "$1" =~ ^[0-9]+$  ]]; then
        local file filepath filename ck
        file=$(awk 'NR==n {print $1}' n=$1 $TRASHLOG)
        filepath=$(awk 'NR==n {print $2}' n=$1 $TRASHLOG)
        filename=$(awk 'NR==n {print $3}' n=$1 $TRASHLOG)
        read -p "Are u sure to recovery $file to $filepath?[y/n]" ck
        if [[ "$ck" == 'y' || "$ck" == 'Y' ]]; then
            $(mv -i $TRASHDIR/$filename $filepath/$file) && echo "$file has been resotred to $filepath" && `sed -i "$1"d $TRASHLOG` 
        fi
    fi 
    unset -f usage
    unset -f list
}

cleartrash () {
    local confirm
    read -p "r u sure to clear the trash bin?[y/n]" confirm
    [ "$confirm" == 'y' ] || [ "$confirm" == 'Y' ] && /bin/rm -rf $TRASHDIR && mkdir $TRASHDIR && touch $TRASHLOG
}

rcyclbin () {
    usage () {
        echo -e "Usage : ryclbin enable|disable|clear"
    }
    if [[ -n "$1" ]]; then
        if [[ "$1" == "enable" ]]; then
            alias rm=del
            alias rl='ls ~/.trash'
            alias rr=recoveryfile
            export TRASHDIR=$HOME/.trash
            export TRASHLOG=$TRASHDIR/.log
        elif [[ "$1" == "disable" ]]; then
            unalias rm
            unalias rl
            unalias rr
            export TRASHBIN=
            export TRASHLOG=
        elif [[ "$1" == "clear" ]]; then
            cleartrash
        else
            usage
        fi
    else
        usage
    fi
    unset -f usage
}

记笔记——Python、Matplotlib、Basemap

Basemap:

 

basemap在初始化地图的时候实在是太慢了,如果需要画多个图,速度感人。解决思路是把一个画完图的对象存进内存(做个备份),然后再需要新的一张图的时候再读取这个图,需要注意的是deepcopy并不管用,需要更深层次的pickle.dumps和pickle.loads。 代码大致是:

def plotbasemap(self, ...):
    if not self.m:
    self.fig = plt.figure(figsize=(xx,xx))
    map = Basemap(...)
    ...
    self.m = pickle.dumps(map)
    return map
return pickle.loads(self.m)

值得注意的是,这种方法也可以用来/快速/保存读取变量。虽然占用空间大,但好在比较便捷,省空间的存法推荐用numpy存成npy。

import pickle
Dataset = ...
pickle.dump(Dataset, open('save.txt','wb')
data = pickle.load(open('save.txt','wb')

数据转化, DictInList -> ListInDict:

如果你有一个数据格式类似
[
    {
        'a' : 1,
        'b' : -1
    },
    {
        'a' : 2,
        'b' : -2
    },
    {
        'a' : 3,
        'b' : -3
    }
]

想转化为

{
    'a' : [1, 2, 3], 
    'b' : [-1, -2, -3]
}

 

def dataFormatConvert(data):
    return { key : [ p[key] for p in data ] for key in data[0].keys() } if type(data) == list else ...

 

 

Matlab 的闭包

==== file: newCounter.m ====
function [incre] = newCounter(init)
    val = init;
    incre = @counter;
    function result = counter()
        val = val + 1;
        result = val;
    end
end

==== command lines ====
>> f = newCounter(0); 
>> f() 
>> ans =
    1

>> f() 
>> ans =
    2

>> f() 
>> ans =
    3

>> fInfo = functions(f); 
>> fInfo.workspace{1} 
>> ans =
    包含以下字段的 struct:
        init: 0
        incre: @newCounter/counter
        val: 3

我写matlab习惯写一堆匿名函数,一来是因为在运行脚本里不给定义函数,二来可以降低代码复写率,提高效率。虽然看起来还是挺丑的。前几天突发奇想,想看下matlab的闭包机制。由于之前写JS和Python,一上来理所当然的写出了下面的代码:

f = @(a) @(x) x + a;
fCell = cell(10,1);
for i = 1 : 10
fCell{i} = f(i);
end
==== command lines ====
>> fCell{1}(5) 
>>     6
>> fCell{6}(5) 
>>     11

这样看是没什么问题,但是像下面这样写也是没问题的:

fCell = cell(10,1);
for i = 1 : 10
    fCell{i} = @(x) x + i;
end
==== command lines ====
>> fCell{1}(5) 
>>     6
>> fCell{6}(5) 
>>     11

第二种是完全没有问题的,因为在封装这个匿名函数的时候,会将函数表达式内用到的外部变量封进一个workspace里,看下下面这个例子,以及函数信息:

i = 2;
j = 3;
f = @(x) x * i + j;
f(10)
i = 4;
f(10)
==== command lines ====
>> ans =
    23

>> ans =
    23

>> fInfo = functions(f) 
>> fInfo =
    包含以下字段的 struct:
    function: '@(x)x*i+j'
    type: 'anonymous'
    file: 'C:\Users\nizek\Desktop\matlab PRO\Untitled.m'
    workspace: {[1×1 struct]}
    within_file_path: 'Untitled'

>> fInfo.workspace{1} 
>> ans =
    包含以下字段的 struct:
        i: 2
        j: 3

可以看到,在封装完函数之后,对外部值的改变并不会改变该匿名函数内部变量的值。而且也没办法改变(对自定义函数有办法改,见下面)。
下面用函数封装一个计数器,每次调用都会加一:

如果中途想改这个值了怎么办,可以做个hook传个函数句柄出来:

==== file: newCounter.m ====
function [incre, setter] = newCounter(init)
    val = init;
    %clear init;
    incre = @counter;
    setter = @setVal;
    function result = counter()
        val = val + 1;
        result = val;
    end
    function [] = setVal(n)
        val = n;
    end
end

==== command lines ====
>> [f, setter] = newCounter(0); 
>> f() 
>> ans =
    1

>> f() 
>> ans =
    2
>> f() 
>> ans =
    3
>> setter(100) 
>> f() 
>> ans =
    101

当然这样看起来太丑了,美化一下:

==== file: newCounter.m ====
function [clss] = newCounter(varargin)
    try val = varargin{1}; catch val = 0; end
    %clear varargin;
    clss = struct('incre',@counter,'setter',@setVal);
    function result = counter()
        val = val + 1;
        result = val;
    end
    function [] = setVal(n)
        val = n;
    end
end

==== command lines ====
>> a=newCounter(); 
>> a.incre() 
>> ans =
    1

>> a.incre() 
>> ans =
    2

>> a.setter(10) 
>> a.incre() 
>> ans =
    11

>> a.incre() 
>> ans =
    12

效率:
同样的环境下,创建10w个:
带struct的函数计数器: 3.97秒
不带struct的函数计数器: 2.45秒
封装的计数器类: 0.70秒
但是用起来倒是速度一样。

#总结
matlab里变量的作用域还是很严格的,封装匿名函数时直接把相关变量一起打包隔离workspace当你用匿名函数生成匿名函数时,甚至会隔多一层空的workspace。
当然如果实际中用的大项目计算量大的话,当然还是写一个class的运行效率高,不建议用写函数的方法来封装,而且抛弃struct的使用,速率也会更快。

Javascript GET方式获得数组

虽然用GET传数组是有点粗暴,最近碰上了多选option筛选传值的问题,用POST不太好。
希望在回到界面的时候能够从url里把筛选状态还原回来,于是需要js获取数组。
GET方式的数组形式是:xxx?array[]=value1&array[]=value2…
把之前在网上找的GetRequest,改了一下支持读取数组值。

function GetRequest() {
    var url = location.search;
    var theRequest = new Object();
    if (url.indexOf("?") != -1) {
        var str = url.substr(1);
        strs = str.split("&");
    for (var i = 0; i < strs.length; i++) { 
        key_value = strs[i].split("=").map(decodeURI).map(unescape); 
        if (key_value[0].indexOf('[') != -1) { 
            key = key_value[0].match(/(\S*)\[\]/)[1]; 
            (theRequest[key] = theRequest[key] || new Array()).push(key_value[1]); 			
        } else { 
            theRequest[key_value[0]] = key_value[1]; 
            } 
        }
    } 
    return theRequest; 
}

 

测试:

> GetRequest()
< {a: "2==1", b: ["1[3]@&^嘻嘻@#", "1W、\\=F@哈哈&%*^[]./"]}