[python] urlsplit 예제

python 2018.09.12 13:12



python에서 URL을 파싱하려면 urllib.parse 모듈의 urlsplit 함수를 사용한다. urlsplit 함수는 URL을 각 구성 요소로 분리한다.




>>> from urllib.parse import urlsplit


>>> components = urlsplit('http://example.webscraping.com/places/default/view')


>>> print(components)

SplitResult(scheme='http', netloc='example.webscraping.com', path='/places/default/view', query='', fragment='')


>>> print(components.path)

/places/default/view





'python' 카테고리의 다른 글

파이썬의 선(Zen of Python)  (0) 2018.09.23
[python] pickle 예시  (0) 2018.09.12
[python] urlsplit 예제  (0) 2018.09.12
파이썬 모듈 프로그래밍 예시 - __init__.py  (0) 2018.09.07
[python] whois 모듈  (0) 2018.09.03
[python] OptionParser 활용하는 사례  (0) 2018.07.04
Posted by 김용환 '김용환'


파이썬 모듈/패키징 프로그래밍의 __init__.py를 활용한 예제이다.

__init__.py은 파이썬 3.3부터는 없어도 잘 동작하지만 호환성을 위해서 둔다.


개념은 파이썬 스터디 싸이트가 잘 되어 있다.

http://pythonstudy.xyz/python/article/18-%ED%8C%A8%ED%82%A4%EC%A7%80



간단한 예시를 만든다.


$ mkdir -p module


$ touch module/__init__.py



xxx라는 모듈을 추가한다.



$ cat > module/xxx.py

def echo():

print("echo")



$ ls -al module/

-rw-r--r--  1 samuel.kim  staff    0  9  7 12:30 __init__.py

-rw-r--r--  1 samuel.kim  staff   27  9  7 12:31 xxx.py



이제 python 인터프리터에서 xxx.py의 echo()를 호출한다.


$ python

Python 3.6.2 (default, Sep  5 2017, 15:21:12)

[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> from module.xxx import echo

>>> echo

<function echo at 0x10c25d268>

>>> echo ()

echo

>>> echo()

echo


잘동작한다.





이번에는 yyy.py에서 xxx.py의 echo()를 호출하는 예이다.


$ cat > module/yyy.py


from module.xxx import echo


def test():

    echo()

    print("Test")



>>> from module.yyy import test

>>> test()

echo

Test




https://github.com/kjam/wswp 를 다운받아서 코드를 실행하고 싶다면,

code 디렉토 밑에서 python을 실행하고 다음과 같이 파이썬 모듈을 실행할 수 있다.



>>> from chp1.advanced_link_crawler import link_crawler

>>> start_url = 'http://example.webscraping.com/index'

>>> link_regex = '/(index|view)'

>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')

Downloading: http://example.webscraping.com/index





'python' 카테고리의 다른 글

[python] pickle 예시  (0) 2018.09.12
[python] urlsplit 예제  (0) 2018.09.12
파이썬 모듈 프로그래밍 예시 - __init__.py  (0) 2018.09.07
[python] whois 모듈  (0) 2018.09.03
[python] OptionParser 활용하는 사례  (0) 2018.07.04
[python] pytz의 평양/서울 시간 버그  (0) 2018.06.19
Posted by 김용환 '김용환'

[python] whois 모듈

python 2018.09.03 14:24


python의 whois 모듈을 사용하면 whois 웹 검색과 동일한 결과를 얻을 수 있다..




$ pip install python-whois



$ python

Python 3.7.0 (default, Sep  3 2018, 12:00:39)

[Clang 7.3.0 (clang-703.0.31)] on darwin

Type "help", "copyright", "credits" or "license" for more information.




>>> import whois

>>> print(whois.whois('appspot.com'))


{

  "domain_name": [

    "APPSPOT.COM",

    "appspot.com"

  ],

  "registrar": "MarkMonitor, Inc.",

  "whois_server": "whois.markmonitor.com",

  "referral_url": null,

  "updated_date": [

    "2018-02-06 10:30:28",

    "2018-02-06 02:30:29-08:00"

  ],

  "creation_date": [

    "2005-03-10 02:27:55",

    "2005-03-09 18:27:55-08:00"

  ],

  "expiration_date": [

    "2019-03-10 01:27:55",

    "2019-03-09 00:00:00-08:00"

  ],

  "name_servers": [

    "NS1.GOOGLE.COM",

    "NS2.GOOGLE.COM",

    "NS3.GOOGLE.COM",

    "NS4.GOOGLE.COM",

    "ns1.google.com",

    "ns2.google.com",

    "ns4.google.com",

    "ns3.google.com"

  ],

  "status": [

    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",

    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",

    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",

    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",

    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",

    "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",

    "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",

    "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",

    "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",

    "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",

    "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",

    "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"

  ],

  "emails": [

    "abusecomplaints@markmonitor.com",

    "whoisrelay@markmonitor.com"

  ],

  "dnssec": "unsigned",

  "name": null,

  "org": "Google LLC",

  "address": null,

  "city": null,

  "state": "CA",

  "zipcode": null,

  "country": "US"

}

>>> print(whois.whois('naver.com'))

{

  "domain_name": [

    "NAVER.COM",

    "naver.com"

  ],

  "registrar": "Gabia, Inc.",

  "whois_server": "whois.gabia.com",

  "referral_url": null,

  "updated_date": [

    "2016-08-05 06:37:57",

    "2018-02-28 11:27:15"

  ],

  "creation_date": [

    "1997-09-12 04:00:00",

    "1997-09-12 00:00:00"

  ],

  "expiration_date": [

    "2023-09-11 04:00:00",

    "2023-09-11 00:00:00"

  ],

  "name_servers": [

    "NS1.NAVER.COM",

    "NS2.NAVER.COM",

    "ns1.naver.com",

    "ns2.naver.com"

  ],

  "status": [

    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",

    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",

    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",

    "ok https://icann.org/epp#ok"

  ],

  "emails": [

    "white.4818@navercorp.com",

    "dl_ssl@navercorp.com",

    "abuse@gabia.com"

  ],

  "dnssec": "unsigned",

  "name": "NAVER Corp.",

  "org": "NAVER Corp.",

  "address": "6 Buljung-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, 463-867, Korea",

  "city": "Gyeonggi",

  "state": null,

  "zipcode": "463463",

  "country": "KR"

}

>>> print(whois.whois('abc.com'))

{

  "domain_name": [

    "ABC.COM",

    "abc.com"

  ],

  "registrar": "CSC CORPORATE DOMAINS, INC.",

  "whois_server": "whois.corporatedomains.com",

  "referral_url": null,

  "updated_date": [

    "2018-08-08 23:38:25",

    "2018-08-08 17:11:02"

  ],

  "creation_date": "1996-05-22 04:00:00",

  "expiration_date": "2019-05-23 04:00:00",

  "name_servers": [

    "ORNS01.DIG.COM",

    "ORNS02.DIG.COM",

    "SENS01.DIG.COM",

    "SENS02.DIG.COM",

    "orns02.dig.com",

    "orns01.dig.com",

    "sens02.dig.com",

    "sens01.dig.com"

  ],

  "status": [

    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",

    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",

    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",

    "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",

    "clientTransferProhibited http://www.icann.org/epp#clientTransferProhibited",

    "serverDeleteProhibited http://www.icann.org/epp#serverDeleteProhibited",

    "serverTransferProhibited http://www.icann.org/epp#serverTransferProhibited"

  ],

  "emails": [

    "domainabuse@cscglobal.com",

    "Corp.DNS.Domains@disney.com"

  ],

  "dnssec": "unsigned",

  "name": "ABC, Inc.; Domain Administrator",

  "org": "ABC, Inc.",

  "address": "77 West 66th Street",

  "city": "New York",

  "state": "NY",

  "zipcode": "10023-6298",

  "country": "US"

}



Posted by 김용환 '김용환'



python에서 옵션(매개 변수)를 받고, 공백(white space)가 있으면 처리해준다.



from optparse import OptionParser

import re


parser = OptionParser()

parser.add_option("--exclude_host", help="excluded host", type="string", default='')

..

exclude_host = re.split("^\s+|\s*,\s*|\s+$", options.exclude_host)




이외에 배열에 대해 A-B 같은 substract 같은 연산을 하고 싶으면 다음 예시를 참조한다.


..

if options.exclude_host is not '':

    fqdn_list = [item for item in fqdn_list if item not in exclude_host ]


Posted by 김용환 '김용환'


python pytz에 좀 버그가 있다. 



근데, 어디서 많이 본 정상혁 아저씨가 보인다

https://github.com/stub42/pytz/blob/master/tz/asia#L1928

(참고 썸머 타임 글 관련 기고 https://d2.naver.com/helloworld/645609)



>>> import pytz

>>> from datetime import datetime

>>> fmt = '%Y-%m-%d %H:%M:%S %Z%z'

>>> seoul = pytz.timezone('Asia/Seoul')

>>> seoul

<DstTzInfo 'Asia/Seoul' LMT+8:28:00 STD>

>>> seoul_dt = seoul.localize(datetime(2018, 6, 19, 17, 53))

>>> seoul_dt.strftime(fmt)

'2018-06-19 17:53:00 KST+0900'


관련해서 내용을 올렸다.


https://github.com/stub42/pytz/issues/15


Hi!

I found a time zone issue which changed Pyongyang(North Korea) time zoned recently.
According to 'https://en.wikipedia.org/wiki/Time_in_North_Korea', I found 'On 29 April 2018, North Korean leader Kim Jong-un announced his country would be returning to UTC+9 to realign its clocks with South Korea. '. It based on the Guadian Newspaper('https://www.theguardian.com/world/2018/may/05/time-for-change-north-korea-moves-clocks-forward-to-match-south')

Below code is not match the Wiki.

import pytz
import datetime

def main():
	
	seoul = pytz.timezone('Asia/Seoul')
	print(seoul.localize(datetime.datetime.now()))
	
	pyongyang = pytz.timezone('Asia/Pyongyang')
	print(pyongyang.localize(datetime.datetime.now()))
	
if __name__ == '__main__':
	main()

The result is below.

2018-06-19 18:23:36.818206+09:00
2018-06-19 18:23:36.818469+08:30

Second result should be equal to '2018-06-19 18:23:36.818469+09:00'

Could you change code and and document(https://github.com/stub42/pytz/blob/master/tz/asia#L1997)?

And When I test the previous example, I found another interesting sample code.

>>> import pytz

>>> from datetime import datetime

>>> fmt = '%Y-%m-%d %H:%M:%S %Z%z'

>>> seoul = pytz.timezone('Asia/Seoul')

>>> seoul

<DstTzInfo 'Asia/Seoul' KST+8:30:00 STD>
>>> pyongyang = pytz.timezone('Asia/Pyongyang')
>>> pyounyang

<DstTzInfo 'Asia/Pyongyang' KST+8:30:00 STD>

Finally, I found another interesting document. You described world timezone. Previsous menthioned, I think it should be changed.

https://github.com/stub42/pytz/blob/master/tz/asia#L47
https://github.com/stub42/pytz/blob/master/tz/asia#L50
-> I think it should be removed at '8:30 KST KDT Korea when at +0830', maintained at ''9:00 KST KDT Korea when at +09'.

Thanks in advance.






Posted by 김용환 '김용환'


간단 코드 예시


import requests


def main():
print('Hello, world!')
response = requests.get('https://httpbin.org/ip')
print(response.status_code)
print(response.headers)
print('Your IP is {0}'.format(response.json()['origin']))

if __name__ == '__main__':
main()



결과는 다음과 같다. 


Hello, world!

200

{'Connection': 'keep-alive', 'Server': 'gunicorn/19.8.1', 'Date': 'Mon, 04 Jun 2018 02:28:09 GMT', 'Content-Type': 'application/json', 'Content-Length': '26', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}

Your IP is 1.1.1.1





HTTPAdapter를 이용하는 코드이다.


from requests import Session
from requests.adapters import HTTPAdapter


def main():
print('Hello, world!')

session = Session()
session.mount("http://", HTTPAdapter(max_retries=3))
response = session.get('https://httpbin.org/ip', timeout=0)

print(response.status_code)
print(response.headers)
print('Your IP is {0}'.format(response.json()['origin']))

if __name__ == '__main__':
main()




결과는 동일하다.




from requests import Session
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


def main():
print('Hello, world!')
retries_number = 3
backoff_factor = 0.3
status_forcelist = (500, 400)

retry = Retry(
total=retries_number,
read=retries_number,
connect=retries_number,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
session = Session()
session.mount("http://", HTTPAdapter(max_retries=retry))
response = session.get('https://httpbin.org/ip', timeout=0)

print(response.status_code)
print(response.headers)
print('Your IP is {0}'.format(response.json()['origin']))

if __name__ == '__main__':
main()



아래 공식에 따르면, 다음과 같다.  총 소요되는 시간은 1.8인데... 


{backoff factor} * (2 ^ ({number of total retries} - 1))



0.3 * ( 2 ^ ( 1 - 1)) = 0

0.3 * ( 2 ^ ( 2 - 1)) = 0.6

0.3 * ( 2 ^ ( 3 - 1)) = 1.2



1.8 = 0 + 0.6 + 1.2






https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#module-urllib3.util.retry


backoff_factor (float) –

A backoff factor to apply between attempts after the second try (most errors are resolved immediately by a second try without a delay). urllib3 will sleep for:

{backoff factor} * (2 ^ ({number of total retries} - 1))

seconds. If the backoff_factor is 0.1, then sleep() will sleep for [0.0s, 0.2s, 0.4s, …] between retries. It will never be longer than Retry.BACKOFF_MAX.

By default, backoff is disabled (set to 0).






만약 타임아웃이 생기면, 중간에 쉬는 타임이 생긴다. retries와 timeout을 잘 사용하면 괜찮을 것 같다. 


response = session.get('https://httpbin.org/ip', timeout=5)

Posted by 김용환 '김용환'


python에 특이한 문법인 try-else문이 있어서 살펴본 예이다. 


except문이 실행되지 않으면 else 문이 실행된다. 

a=0
try:
a=1
except ZeroDivisionError as e:
print(str(e))
else:
print(a)


결과는 1이다.





다음은 일부러 0으로 나눠 ZeroDivisionError를 발생시키는 코드이다. 


except 문이 실행되면 else문이 실행되지 않는다. 

a=0
try:
a = 4/0
except ZeroDivisionError as e:
print(str(e))
else:
print(a)


결과는 다음과 같다.


division by zero




Posted by 김용환 '김용환'


python에 특이한 문법인 for else가 있다고 해서 살펴봤다. 


for - else를 실행하는 예이다.


data = [1, 2, 3, 4, 5]
for i in data:
print(i)
else:
print("aa")

print("end")


결과는 다음과 같다.


1

2

3

4

5

aa

end






왜 필요할까. break와 연관된 문법이다.



중간에 break 문을 만나 for 문 바깥으로 나가면 else 문이 출력되지 않는다. 


data = [1, 2, 3, 4, 5]
for i in data:
print(i)
if i == 3:
break
else:
print("aa")

print("end")




결과는 다음과 같다.




1

2

3

end





Posted by 김용환 '김용환'

zookeeper와 연동하는 kazoo를 python3로 업그레이드하면서 알게된 내용이다.




python2에서는 바이트 문자열(byte string)이라는 것은 무시되었다.



A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.



그러나, python3부터는 바이트 문자열을 b또는 B로 쓰이게 되었다.


Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.





즉 저장할 때 문자열은 encode()로,


 value.encode()


읽을 때는 decode()로 읽는다.


value.decode()






Posted by 김용환 '김용환'



python3에서 


jinja2.exceptions.UndefinedError: 'len' is undefined 해결하려면


'|length'를 이용한다.


{% node.data|length == 0 %}


Posted by 김용환 '김용환'