欢迎来到天天文库
浏览记录
ID:7282003
大小:1.21 MB
页数:16页
时间:2018-02-10
《bad data lurking in plain text》由会员上传分享,免费在线阅读,更多相关内容在工程资料-天天文库。
1、CHAPTER4BadDataLurkinginPlainTextJoshLevy,PhDThisistheUnixphilosophy:Writeprogramsthatdoonethinganddoitwell.Writeprogramstoworktogether.Writeprogramstohandletextstreams,becausethatisauniversalinterface.—DougMcIlroyBaddataisoftendeliveredwithawarningoranapologysuchas,“Thisdumpisarealme
2、ss,butmaybeyou’llfindsomethingthere.”Somebaddatacomeswithamorevacuouslabel:“Thisisplaintext,tab-delimited.Itwon’tgiveyouanytrouble.”Inthisarticle,I’llpresentdataproblemsI’veencounteredwhileperformingseeminglysimpleanalysisofdatastoredinplaintextfilesandthestrategiesI’veusedtogetpastth
3、eproblemsandbacktowork.TheproblemsI’lldiscussare:1.Unknowncharacterencoding2.Misrepresentedcharacterencoding3.Application-specificcharactersleakingintoplaintextI’llusesnippetsofPythoncodetoillustratetheseproblemsandtheirsolutions.MydemoprogramswillrunagainstastockinstallofPython2.7.2w
4、ithoutanyadditionalre‐quirements.Thereare,however,manyexcellentOpenSourcelibrariesfortextprocess‐inginPython.Towardstheendofthearticle,I’llsurveyafewofmyfavorites.I’llconcludewithasetofexercisesthatthereadercanperformonpubliclyavailabledata.53WhichPlainTextEncoding?McIlroy’sadviceabov
5、eisincrediblypowerful,butitmustbetakenwithawordofcaution:sometextstreamsaremoreuniversalthanothers.Atextencodingisthemappingbetweenthecharactersthatcanoccurinaplaintextfileandthenumberscomputersusetorepresentthem.Aprogramthatjoinsdatafrommultiplesourcesmaymisbehaveifitsinputswerewritt
6、enusingdifferenttextencodings.ThisisaproblemIencounteredwhilematchingnameslistedinplaintextfiles.MyclienthadseverallistsofnamesthatIreceivedinplaintextfiles.Somelistscontainednamesofpeoplewithwhomtheclientconductedbusiness;otherscontainedthenamesofknownbadactorswithwhombusinessesarefo
7、rbiddenfromtransacting.Thelistswereprovidedas-is,withlittleornoaccompanyingdocumentation.Theprojectwaspartofanaudittodeterminewhich,ifany,oftheclient’spartnerswereonthebadactorslists.ThematchingsoftwarethatIwrotewasonlyapartofthesolution.Thesuspectedmatchesitidentifiedwerethensenttoat
8、eamof
此文档下载收益归作者所有