Vous êtes sur la page 1sur 479
: [Fabie ot Content Text Processing in Python eybavia wend StatReading > Publisher: Adgison Wesley Pub Date: June 06, 2008, ISBN: 0-321-11284-7 Pages: 544 Text Processing in Pythonis an example-dtiven, hands-on tutorial that carefully teaches programmers how to accomplish numerous text processing tasks using the Python language. Filled with concrete examples, ths book provides efficient and effecve solutions to specitic text processing problems and practical strategies for dealing wit all ypes of text processing challenges. Text Processing in Python begins with an introduction to text processing and contains a quick Python tutorial to get you upto speed. t then delves into essential text processing subject areas, including sting operations, regular expressions, parsers and state machines, and Intemet tools and techniques. Appendixes caver such important topics as data compression and Unicede. A comprehensive index ‘and plenitulcrass-reterencing offer easy access to available information. In addition, exercises throughout the book provide readers with further opportunity to hone their skis ether on ther own or inthe classroom. A companion Web site (htip/gnosis.x/TPIP) contains source code and examples from the book. Here is some of what you wl ind in thie Book: © When do use formal parsers to process structured and semi-structured data? Page 2 (© How do | work wit ful txt indexing? Page 199 © What pattems intext can be expressed using regular expressions? Page 204 © How do I tnd @ URL or an email adress intext? Page 228 © How do | process a report wit a concrete sate machine? Page 274 © How do | parse, crete, and manipulate interet formats? Page 345, (© How do handle lossless and lossy compression? Page 454 (© How do tnd codepoints in Unicode? Page 465 aeaties : [Fable of Content ‘Text Processing in Python eybavia werd] Sat Reading > Publisher: Adgison Wesley Pub Date: June 06, 2008, ISBN: 0-321-11254-7 Pages: 544 rota ection 0.1. Wha ls Text Processing ection 0.2. The Philosophy of Text Processing ection 0.5, What Youll Need fo Use This Boo ection 0.4. Conventions Used in This Boo ection 0.5. A Word on Source Cade Examples ection 0.6. External Resour knawledgment hapter 1. Python Basi ection 1.1, Techniques and Patter: ection 1.2. Standard Module ection 1.3. Other Modules inthe Standard Libra hapler 2. Basie Sting Operation ection 21, Some Common Task ection 22, Standard Module ection 2.3, Solving Problem hapter 3. Regular Expression ection 3.1. A Regular Expression Tutorial ection 3.2. Some Common Tasks ection 4.5, Parser Libraries or Pyther hapte 5. Internet Tools and Techniaued ection 5.1. Working with Email and Newsgroup Becton 5.3. Synopses of Other Inernet Module ection AS, Functional Programming pendix 8. A Data Compression rime Becton 6.1. Intoductio ection B.S. A Data Set Examph ection B.4. Whitespace Compre ection B.9._A Custom Text Compress ection B10. References PRopendx.C. Understanding Unicod ection 1, Some Background on Character] ection C.2. What Is Uricode ection C.4. Declaration ection C.5. Finding Cosepoin Bogendix D. A State Machine for Adding Markup to Tex Crean ta] nn Copyright Mary ofthe designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designators appear inthis book, and Addison-Wesley was aware of the trademark claim, the designations have been printed in intial capital eters oral capital letters. ‘The author and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibilty for erors or omissions. No lability is assumed for incidental or consequential damages in connection with or atsing out ofthe use ofthe information or programs contained herein, ‘The publisher offers ciscounts on this book when ordered in quantty for bulk purchases and special sales. For more information, please contact US. Corporate and Goverment Sales resales @pearsontechgroup con] For sales outside ofthe U.S., please contact International Sales riemational@pearsontecharoup con] Visit Addison-Wesley onthe Webs fw awprofessional cor Library of Congress Cataloging-n-Publication Data Mertz, David. ‘Text processing in Python / David Mertz pom Includes bibiographical references and index ISBN 0-321-11254-7 (ak. Paper) 1. Text processing (Computer science) 2. Python (Computer program language) Tie. (0A76.9.748M47 2008, 005.13'5> 1817 # integer division 1 >> 19/7 # float division 1.8571428571428572 In documentation of module functions, where named arguments are available, they af ised wit their default value. Optional arguments are listed in square brackets. These conventions are also used in the Python Library Reference. For example: foobar.spam(s, val=23 [,taste="spicy")) ‘The function foobar spam) uses the arguments to a named argument does not have a specifiable detaul value, the argument is listed folwed by an equal sign and ellipsis. For example: foobar-baz(string=..., maxlen=...) The foobar.baz{) function \With the introduction of Unicode support to Python, an equivalence between a character and a byte no longer holds inal cases. Where an operation takes a numeric argument affecting a string-ke objec, the documentation wil specity whether characters or bytes are being counted. For example: Operation A reads num bytes from the buffer. Operation B readsnum characters from the butter. ‘The frst operation indicates a number of actual it bytes affected. The second operation indicates an indefinite number of bytes are affected, but that they compose a numberof (maybe mulibyte) characters Crean ta] nn Crean ta] nn 0.5 A Word on Source Code Examples First things fst All the soutce code inthis book s hereby released to the public domain. You can use it however you lke, without restriction. You can include iin ree software, orn commercial/proprietary projects. Change ito your hears content, and in any manner you want. Ifyou fee! ike giving credit tothe author (or sending him large checks) for code you find useful, that i ne—but no obigation to do so exists. Al the source code in tis book, and various other pubic domain examples, can be found atthe book’s Web site. If such an electronic form is more convenient for you, we hope this helps you. In fact, if you are able, you might benef rom visiting this location, where you might find updated versions of examples or other useful utes not mentioned inthe book. First things out ofthe way, let us tum to second things. Lite of the source code inthis book i intended as a final say on how to perform ‘a given task. Many of the examples are easy enough to copy directly ito your own program, or to use as slandaloneutites. Bu the real (goal in presenting the examples is educational. We really hope you wil think about what the examples do, and wy they doit the way they do. n fact, we hope readers wil think of better, faster, and more general ways of performing the same tasks. If the examples work their best, they shouldbe better as inspirations than as instructions. Cn an Crean ta] nn 0.6 External Resources 0.6.1 General Resources ‘A good clearinghouse for resources and links related to this book is the book's Web site. Over ime, | wil add errata and additional ‘examples, questions, answers, ulties, and so onto the site, so check it rom time to ime: fitozignasis cx TPP ‘The frst place you should probably tum for any question on Python programming (ater this book, is: ‘The Python newsgroup is an amazingly useful resource, with discussion that is generally both tiendly and rut. ‘You may also post to and folow the newsgroup via @ mirored maling ls: ipa pyhon orgimalimanlistinfoipyhor 0.6.2 Books This book generally aims at an intermediate reader. Other Python books are beter introductory texts (especially for those fay new to programming generaly). Some good introductory texts are Core Python Programming, Wesley J. Chun, Prentice Hall, 2001. ISBN: 0-190-26036-3, ‘Leaming Python, Mark Lutz & David Ascher, ORelly, 1999. ISBN: 1-56592-464-9, ‘The Quick Python Book, Daryl Harms & Kenneth McDonald, Manning, 2000, ISBN: 1-884777-74-0. {As introductions | would generally recommend these books in the order Iisted, but learning styles vary between readers. Two texts that overap tis book somewhat, bu focus more narrowly on referencing the standard library, are: Python Essential Reference, Second Ealtion, David M, Beazley, New Riders, 2001. ISBN: 0-7357-1081-0 Python Standard Library, Fredtik Lund, O'Reilly, 2001. ISBN: 0-596-00096-0, For coverage of XML, ata far mote detailed level than this book has room for, isthe excellent text Python & XML, Christopher A. Jones & Fred L. Drake, J, O'Relly, 2002. ISBN: 0-596-00128-2. 0.6.3 Software ectories Current, the best Python-specfc directory for software isthe Vaults of Parnassus: tip ww vex netipamassus ‘SourceForge isa general open source sofware resource. Many projects—Python and otherwise—are hosted a that site, and the site Provides search capabilties, Keywords, category browsing, and the ke: tip sourceforge net} Freshmeat is another widely used directory of software projects (moslly open source). Like the Vaults of Parnassus, Freshmeat does not rectly hos project es, but simply acts as an information clearinghouse for finding relevant projects: fitpsitreshmeatine ‘A numberof Python projects are discussed inthis book. Most of those are ised in one or mare of te software directories mentioned above. A general search engine like Google, operator. Tim Andrews ‘Advice on carving explanations of compression techriques. Roland Geriach foland@riga coma “Typos are boundless, but abit ess fr his email Antonio Cun for tne inf" ifisCondlline: # (22 version reads lazily) selected appendtine) el ine +# Cleanup transient variable There is nothing wrong with these few lines (seetreadlines on efficiency issues). Buti does take afew seconds to read through them. In ‘my opinion, even this small block of ines does not parse asa single thought, even though its operation really is such. Also the variable lines slighty superfuous (and itretains a value asa side effect after the loop and aso could conceivably step on a previously defined value. In FP style, we could write the simpler selected = fterisCond, open(flename) readines()) #Py2.2-> ier'sCond, open{flename)) Inthe concrete, a textual source that one frequently wants to process as alist of ines is alg file. Al sorts of applications produce log files, most typical either ones that cause system changes that might need to be examined or long-unning applications that perform actions intermitertly. For example, the PythonLabs Windows installer for Python 2.2 produces a fie called INSTALL LOG that contains a list of actions taken during the install. Below is a highly abridged copy ofthis file from one of my computers: INSTALL.LOG sample data file Title: Python 2.2 Source: C!DOWNLOADPYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248 Made Dit: D’Python22 File Copy: DaPython22\UNWISE.EXE | 05-24-2001 | 12:59:30 || egDB Key: Software\MicrasoftWindows\CurrentVersioniUninstaliPy. RegD8 Val: Python 2.2 File Copy: DPython22.w@xpopen exe | 12-21-2001 | 12:22:34 || Made Dir: D’PYTHON22DLLS File Overwrte: C:WINDOWS'SYSTEMIMSVCRT.DLL | ||| 286000 | 770c8856 egD8 Root: 2 RegDB Key: Software\MicrasoftWindows\CurrentVersion\vipp Paths\Py. FRegDB Val: D\PYTHONZ2ython.exe Shell Link: C:WINDOWS'Start Menu\Programs'Python 2.2\Uninstall Py Link Info: DAPython22\UNWISE.EXE | D\PYTHON22 | |0| 10) ‘Shel Link: C:WINDOWSIStar Menu\Pragrams'Python 2.2\Python Link Info: DAPython22'python exe | D:PYTHON22 | DAPYTHON22. You can see that each action recorded belongs to one of several types. A processing application would presumably handle each type of action aiffrently (especialy since each action has diferent data fields associated wih). It is easy enough to write Boolean functions that identity line types, for example: Get isFleCopyline) retum line[:10}==File Copy’ #or tne startwth..) Get isFleOverwritetine): return line[15}==Fle Overwrite” ‘The string method “:starswith)is less error prone than an inal sice for recent Python versions, but these examples are compatible with Python 1.5 In a sighty more compact unconal programming style, you can also write these lke: Selecting ines of a certain type is done exactly as above: lines = open(r'dpython22\nstal. og) eaclines() regroot ines = fter'sRegDBRoot, ines) BButf you want to select upon multiple criteria, an FP style can initially become cumbersome. For example, suppose you are interested in althe "RegDB" lines; you could write @ new custom function fortis fiter: et isAnyFegDBne) if line t}=="RegDB Root: return 1 eQDB Key": return 1 12908 Val’: retumn 1 return 0 4 For tecent Pythons, ine startswith. is better Programming a custom function foreach combined condition can produce a glut of named functions. More important, each such custom function requires a modicum of work to write and has a nonzero chance of intraducing a bug. For cnaitions that should be jointly satsied, you can ether write custom functions or nest several ltrs within each other. Far exam: shortine = lambda ine: lenline) < 25, short regvals = fiter(shortne, iter sFegDBVal,nes)) In this example, we rely on previously defined functions forthe fiter. Any eror in the fiters will bein either shortne() or sRegDBVal), but not independently in some third function isShorfRegVal). Such nested fits, however are difcult to read-—especially it ore than two are involved Calis to map() are sometimes similarly nested if several operations are to be performed on the same sting. Fora fairy tval example, suppose you wished to reverse, capitalize, and normalize whitespace in ines of text. Creating the suppor function i straightforward, and they could be nested in map) cals: from string import upper, oi, split et ts) a= ists) areversel) return joina,") rormalize = lambda s:join(spit(s)") ap flip_norms = map(upper, map(tp, map(nomalze, ines))) This type of map() or fier) ness cificult to read, and should be avoided. Moreover, one can sometimes be drawn into nesting alterating map) and fer) calls, making matters stil worse. For example, suppose you want to perform several operations on each of the lines that meet several eter, To avoid this trap, many programmers fall back to a more verbose imperative coding style that simply wraps the lists in afew loops and creates some temporary variables for intermediate results. \Within a functional programming syle, itis nonetheless possible to avoid the pital of excessive cal nesting. The key to doing this is an inteligent selection ofa few combinatorial higher-order functions. In general, a higher-order function is one tha takes as argument or Fetus as result a function objec. Fist-order functions just take some data as arguments and produce a datum as an answer (perhaps a data-structue ike alist or dictionary). In contrast, the "inputs" and "outputs of a HOF are mare function objects—ones generally intended tobe eventually called somewhere later in the program flow. (One example ofa higher-order function isa function factory:a function (or class) that retums a function, or colection of functions, that are somehow “configured” at the tie of ther creation. The "Hello World” of function factories is an “adder” factory. Like "Hello World” an adder factory exists just to show what can be done; it doesn’ realy de anything useful by itself. Pretty much every explanation of function factories uses an example such as: > det adder_tactory(n return lambda m, nen men o> add10 = adder_tactory(10) poe adai0 at Ox00FB0020> o> addi) 14 >> add10(20) o> add6(4) 8 For text processing tasks, simple function factories are o less interest than are combinatoia! HOF. The idea ofa combinatorial higher-order function isto take several (usualy fit-arder) functions as arguments and retum anew function that somehow synthesizes the operations ofthe argument functions. Below s a simple library of combinatorial higher-order functions that achieve surprisingly much ina small number of lines: combinatorial.py from operator import mul, ad, truth ‘apply_each = lambda ins, args=[ maplapply, ins, (args}lentfns)) bools = lambda tst: map(tuth, 1st) bool_each = lamibda ins, ags=t| bools{apply_eachitns, args)) conjoin = lambda tns, args=f}:reduce(mul, bool_eachitns,args)) al = lambda ins: lambda arg, ns=s: conjoin(ns, (arg) both = lambda f.: alla) ald = lambda fg. alla.) and_= lambda fg: lambda x, ff, dog: fx) and gtx) lambda tns, args[:reduceadd, bool_eachns, rgs)) lambda tns: lambda arg, fns=Ins:clsjin(ns, (ara) Even with just over a dozen lines, many ofthese combinatorial functions are merely convenience functions that wrap other mare general ‘ones. Let us take a look at how we can use these HOFs to simpy some ofthe earler examples. The same names are used for results, 0 look above for comparisons: ‘Some examples using higher-order functions # Don't nest fiters, just produce func that does both short reqvals = fterboth(shortine,isRegVal) lines) ‘Dont mutiply ad hoc functions, just describe need regroot tines =) fiter{some(['sRegDBRoot, isRegDBKey, isRegDBVal), ines) 4 Don't nest transformations, make one combined transform capFlipNorm = compose3\upper, ip, normalize) cap_tlip_norms = map(capFlipNorm, ines) Inthe example, we bind the composed function capFiipNorm for readability. The corresponding map) ine expresses jus thesingle ‘thought of applying a common operation to al the lines. But the binding also ilustrates some of the flexbilty of combinatorial functions. By condensing the several operations previously nested in several map() calls, we can save the combined operation for reuse elsewhere in the program. As arule of thumb, I recommend not using more than one fiter() and one map( in any given ine of code. If these "ist application” functions need to nest more deeply than this, readabiiy is preserved by saving results to intermediate names. Successive lines of such functional programming style cals themselves revert to a more imperative stye—but a wonderful thing about Python isthe degree to wich it alows seamless combinations of tferent programming styles. For example: intermed = fite(nicePropery, map(someTransform, ins}) final = map(otherTransform,intermed Any nesting of successive fiter() or map) calls, however, can be reduced to single functions using the proper combinatorial HOF. Therefore, the number of procedural steps needed s petty much always quite small However, the reduction in total ines-of-cade is offset by he ines used for giving names to combinatorial functions. Overall, FP style code is usually about one-half the length of imperative style equivalens (ewer lines generaly mean correspondingly fewer bugs). Arice feature of combinatorial functions is that they can provide a complete Boolean algebra for functions that have not been called yet (the use of operator. act and operator:mul in combinatorial y is more than accidental in that sense). For example, witha collection of simple values, you might express a (comple) relation of multiple truth values as: salsied = this or that) and (fo or bar) In the case of text processing on chunks of text, these truth values are often the results of predicatve functions applied to a chunk: satisfied = thisP(s) or thatP(s)) and (fooP(s) or barP(s)) Inan expression tke the above one, several predicate functions are applied tothe same string (or other object), and a set of logical relations on the results are evaluated. But tis expression i itself a logical predicate ofthe string. For naming claity—and especially YoU wish ta evaluate the same predicate more than once—tis convenient to create an actual function expressing the predicate salsiedP = both(etherthisP thatP), etrer((ooP,barP)) Using a predicative function created with combinatorial techniques isthe same as using any other function: selected = fite(satsfiedP, lines) 1.1.2 Exercise: More on combinatorial functions ‘The module combinatorial py presented above provides some of the mast commonly useful combinatorial higher-order functions. But there is room for enhancement in the brief example. Creating a personal or organization library of useful HOFS is @ way to improve the reusabilty of your current text processing libraries. QUESTIONS 4: Some ofthe unctions defined in combinatorial py are no, stcty speaking, combinatorial. n a precise sense, a combinatorial tunction should take one or several functions as arguments and relurn one of more function objects that ‘comiine” the input arguments Ident which functions are not "sticty" combinatorial, and determine exactly what ype of thing each one does return The functions both() and and_() do almost the same thing. But they difer in an important, albet subtle, wayand_(), tke the Python operator and, uses shortuttingin its evaluation. Consider these ines: lambda n:n**2 > 10 >>> g= lambda n: 100/n > 10, >>> and_{t.9)6) 1 >>> both(t.g)6) oot >>> both(t.g)0) Traceback (most recent call lst): ‘The sharteuting and_() can potentially allow the first function to act as a “guard or the second one. The second function never gets called if the frst function retums a false value on a given argument. Create a similarly shortcuting combinatorialo_() function for your library . Create general shortcutting functions shortcut all) and shortcut_some( that behave similarly tothe functions all) and somei), respectively. Describe some situations where nonshortcutting combinatorial functions lke both), ll), or anyoF9() are more desirable than similar shortcuttng functions. 3: The function dent) would appear tobe pointess since it simply returns whatever value is passed tit. In tuthident() is an almost indispensable function for a combinatorial collection. Explain the significance of ident. Hint: Suppose you have a list of ines of text, where some ofthe lines may be empty strings. What fit can you apply to find all the lines that stat with a #? 4: The function not_() might make a nice addition toa combinatorial brary We could define ths function as: >>> not_= lambda f: lambda x, ff: not 1) Explore some situations where anot_( function would aid combinatoric programming. ‘The function apply_each() is used incombinatorialpy to buld some other functions. But the uty olapply_eachy) is more ‘general than its supporting role might suggest. A trvial usage of apply_each() might laok something ike: >>> apply_each(mapladder_tactory, range(S)(10,)} 10,11, 12, 13, 14) Explore some situations whereapply_each() simplties applying multiple operations to a chunk of text 6: Unlike the functions all) and some), the functions compose() and compose3}) take a fixed number of input functions as, ‘arguments. Create a generalized composition function that takes a ist of inut functions, of any length, as an argument. \What otner combinatorial higher-order functions that have not been discussed here are likely to prove usefl intext processing? Consider other ways of combining fist-order functions nto useful operations, and add these to your lorry ‘What are good names for these enhanced HOFs? 1.1.3 Specializing Python Datatypes Python comes with an excellent collection of standard datatypes—fappendic Altiscusses each builtin type. At the same time, an important principle of Pytnon programming makes types less important than programmers coming trom other languages tend to expec. ‘According o Python's "principle of pervasive polymorphism” (my own coinage), iis more important what an object does than wha itis. ‘Another comman way of putting the principe is: fit walks ike a duck and quacks lke a duck, treat itike a duck, Broadly, the dea behind polymorphism is leting the same function or operator work on things of ctferent types. In C++ or Java, for ‘example, you might use signature-based methad overloading to let an operation apply o several types of things (acting iferenty as needed). For example’ C++ signature-based polymorphism include class Prin | publi: oid printint)_{print{'int etn’; ‘oid print(doubie a (print("double %tn" d);) void prin(oat 1 {print(oat eh); ) hi main) ( Print *p = new Print); peprinti7; => "int 37" *) ->prin(37.0); > “double 37.000000" +) ) The most direct Python transaton of signature-based overloading is a function that performs type checks on its arguments). tis simple to write such functions: Python "signature-based" polymorphism Get Print from types import * iftypetx) is FloatType: print ‘lot’, x eli type() is Int Type: print int’ x eli type(x) is LongType: print tong", x \Writing signature-based functions, however, is extremely un-Pythonic. f you find yourself performing these sorts of expict ype checks, you have probably nat understood the problem you want to salve correct! What you should (usually) be interested ins not what typex is, but rather whether x can perform the action you need it to perform (regardless of what type of thing itis stict. PYTHONIC POLYMORPHISM Probably the single most common case where pervasive polymorphism is useful isn denttying "le-ike" objects. There are many cbjects that can do things that files can do, such as those created with utd, eStringlO, zpfile, and by other means. Various objects can perform only subsets of what actual files can: some can read, others can wrt, sil others can seek, and so on. But for many purposes, you have no need to exercise every “eke” capablity—it is good enough to make sure that a speced object has those capabilities you actualy need Here isa typical example. | have a module that uses DOM to work with XML documents; | would lke users to be abe to speci an XML source in any of several ways: using the name of an XML fle, passing a fle-ke object that contains XML, or indicating an already buit [DOM object to work wih (ult with any of several XML ibrares). Moreover, future users of my module may get their XML from novel places | have not even thought of (an RDBMS, over sockets, etc). By looking at what a candidate object can do, | can just ulize whichever capabilities that object has: Python capability-based polymorphism (ef toDOMxml_src=None} from xm.dom import minidom ithasatt(xml_src, ‘documentElement): return xml_sre’ #itis already a DOM object cl hasatrxm|_sr.read) ‘itis something that knows how to read data reluin minidom.parseStringxml_sre.read)) eli type(xmi_sr) in (StingType, UnicodeType): ‘itis flename of an XML document xml = open(xml_ste) read) return minidom,parseString(xm)) eke: raise ValueE:rr, "Must be iiaized wth "+ “flename,file-ike object, or DOM object” Even simple-seeming numeric types have varying capabiltes. As with other objects, you should not usually care about the intemal representation of an object, but rather about what can do, Of course, as one way to assure that an object has a capabily, itis often ‘appropriate to coerce it toa type using the built-in functions complex), oct), flat) int), st, long), st, tuple), and unicode. All of these functions make a good effort to transform anything that looks a ite bit ike the type of thing they name nto a true instance oft. tis Usually nt necessary, however, actually to transform values fo prescribed types: again we can just check capabilties. For example, suppose that you want to remove the "least significant” portion of any number—perhaps because they represent ‘measurements of imited accuracy. For whole numbers—ints or longs—you might mask out some low-order bits; for fractional values you might round to given precision. Rather than testing value types explicit, you can look for numeric capabiliies. One common way to test a capability in Python sto try to do something, and catch any exceptions that occur then try something ese). Below isa simple example: Checking what numbers can do det approx(: Hint attibutes require 22+ ithasati(_and_: # support bitwise-and return x & “OxOFL ty: # supports reatimag retun (ound(x.real2)round(ximag.2)"t) except AtbuteE ror return round2) ENHANCED OBJECTS ‘The reason thatthe principle of pervasive polymorphism matters is because Python makes it easy to create new objects that behave ‘mastly—but not exacty—Ike basic datatypes. Fle-ike objects were already mentioned as examples; you may ar may not think ofa fle object as a datatype precisely. But even basic datatypes ike numbers, stings, Ist, and dictionaries can be easily specialized andlor emulated, ‘There are two details to pay attention to when emulating basi datatypes. The most important matter to understand i that the capabilties of an object—even those utiized with syntactic constucts—are generally implemented by its "magic" methods, each named with leacing ‘and traling double underscores. Any object that has te right magic metheds can act tke a basic datatype in those contexts that use the supplied methods. At heart, a basic datatype is just an object with some wel-optimized versions of the right collection of magic methods. The second detail concems exactly how you get at the magic methods—or rather, how best o make use of existing implementations. There is nathing stopping you from writing your ow version of any basic dataype, excep ort pidding details of doing so. However, there are quite a few such detals, and the easiest way to get the functonalty you want is to specialize an existing class. Under all ron-ancient versions of Python, the standard brary provides the pure-Python modules UserDict, UserLst, and UserSiring as starting Points for custom datatypes. You can inher from an appropriate parent class and specialize (magic) methods as needed. No sample parents are provided for tuples, nts, floats, andthe rst, however. Under Python 2.2 and above, a better option is available. "New-style" Python classes let you inherit from the underlying C implementations of al the Pytnon basic datalypes. Moreover, these parent classes have become the sel-same calable objects that are Used to coerce lypes and construct objects: nist), unicode), and so on. There is alt of arcana and subtle profundtis that ‘accompany new-siyle classes, but you generally dont need to worry about these. All ou need to know is that a cass that inherits rom strings faster than one that ines ftomUserStrng; likewise for list versus UserList and dct versus UserDiet (assuming your scripts all run on a recent enough version of Python. Custom datatypes, however, need not specialize ful-ledged implementations. You are ree to create classes that implement "ust enough of the interface ofa basic datatype to be used for a given purpose. Of course, in practice, tne reason you would create such custom datatypes is ether because you want them to contain non-magic methods of their own or because you want them to implement the magic methods associated with multiple basic datalypes. For example, below isa custom datatype that can be passed tothe prior approx) function, and that also provides a (slighty) useful custom method: >> class I: # "Fuzzy" integer datatype def _init_{(selti): selti=i det _and (sel: retum selti 1 def er range(se: lbound = approx(sel) retun “Value: (ed, %)" % (bound, bounds 0x0F) >o31, 12 = 129), 120) >> approx(t), approx(2) (16, 161) o> ier rangel) "Value: (16, 31)" Despite supporting an extra methad and being able to get passed into the approx() function, lis not a very versatile datatype. It you try to ad, or dvide, oF multiply using fuzzy integers," you wll rise a TypeEror. Since there is no module called Userint under an older Python version you would need to implement every needed magic method yoursel. Using new-style classes in Python 22s, you could derive a “fuzzy integer trom the underying int datatype. A partial implementation could look ike: >> clas lint): # New-stye fuzzy integer ef _add_(set vals = mapiint,lapprox(sel), approx} k= int_add__(vals) retum [2jint._add_(k, OxOF)) def ert range(sei: retum “Value (Yd, Ye)" (bound lbound+0x0F) opi, i2-= 12029), 12(20) >> print t=" i1.er_range()"i2 =" i2ent_range() it = Value: (16,31) :i2 = Value: 16, 31) poo ib=i1 +2 >> print 3, ype), 47 Since the new-syle clas in already supports bitwise-and, there is no need to implement it again. With new-style classes, you refer to data values cirecly wih self, rather than as an atibute that holds the data (e.gseltin class 1). As well, itis generally unsafe to use syntactic operators within magic methods that define thir operation; for example, utlize the ._add__() method ofthe parentint rather than the + operator inthe I2_add_() method. In practice, you are less likely to want to create number-ke datalypes than you are to emulate container types. But itis worth Understanding just how and why even plain integers area fuzzy concept in Python (the fuzziness of the concepts is ofa diferent sort, than the fuzziness of [2 integers, though). Even a function that operates on whole numbers need not operate on objects ont Type of LongType—just on an object that satisfies the desired protocols. 1.1.4 Base Classes for Datatypes There are several magic methods that are often useful to define for any custom datatype. In fact, these methods are useful even for classes that do not really define datatypes (in some sense, every objects a datatype since it can contain attribute values, but not every object supports special syntax such as arithmetic operators and indexing). Not quite every magic method that you can define is documented inthis book, but most are under the parent datatype each is most relevant to. Moreover, each new version of Python has introduced few addtional magic methods; those covered either have been around for a few versions or are particulary important In documenting class methods of base classes, the same general conventions are used as for documenting module functions. The one special convention for these base class methods isthe use of self as the fst argument to all methods. Since the nameselfis purely arbitrary, this convention is less special than it might appear. For example, both of te following uses of self are equally legal: >>> import sting >>> sell= spam >>> object._repr_(sel) "estr object at Ox12c0a0>" >> sing. upperself ‘SPAM However, there is usualy litle reason to use class methods in place of pertectly good bul-in and module functions withthe same purpose. Normally, these methods of datatype classes are used only in child classes that averide the base classes, as in: >> class UpperObjectobject: def _repr_(cet reluin object._repr_(set).upper() UpperObiect() > print uo ‘<_MAIN_UPPEROBJECT OBJECT AT ax1c2c6C> ‘object Ancestor class for new-style datatypes Under Python 2 2+, object has become a base fr new-style classes. Inhetiingfrompbject enables a custom class to use a few new capabilities, such as slots and properties. But usual if you are interested in creating a custom datatype, itis better to inher from a child of object, such aslist, fot, or dict. METHODS object._eq_(self, other) Return a Boolean comparison between self and other. Determines how a datalype responds to the operator. The parent class object oes not implement .__eq_0 since by default object equalty means the same thing as identity (ths operator) childs tree to implement ths in order to affect comparisons. object._ne__(self, other) Return a Boolean comparison between self and other. Determines how a datalype responds tothe = and-<> operators, The parent class object does not implement ._ne_() since by default object inequality means the same thing as nonidenty (thés not operator). Although it might seem that equally and inequalty always return opposite values, the methods are not explicitly defined in tems ofeach other. You could force the relationship with: o> class EOlobject) 4# Abstract parent class for equality classes def _ eq _{selt,0):return not self <> 0 def _ne_(selto):return not sel >>> class Comparable(EQ) +#By defing inequity, get equty (or vioe versa) ef _ne_(selt, other: relum someComplexComparison(slf, other) object._nonzero__(self) Return a Boolean value for an object. Determines how a datalype responds tothe Boolean comparisons or, and, andmnot, and toi and fiter(None,..) tests. An object whose ._nonzero_() method retums a rue value is itself treated as a tue value object._len_(self) len(object) Return an integer representing the length” ofthe object. For collection types, this is fairy straightforward-—how many objects are inthe collection? Custom types may change the behavior to some other meaningtl value. object._repr_(self) repr(object) object._str_(self) str(object) Return a string representation ofthe object sell. Determines how a datatype responds to the repr) and st) built-in functions, to theprint keyword, and tothe back-tick operator. Where feasible itis desirable to have the. repr_() method return a representation with sufficient information init to reconstruct an identical abject. The goal here is to ull he equalty obj-—evalfreprb), In many cases, however, you cannot encode sufficient information in a string, and the repr) of an objec is either identical to, or sighly more detailed than, thet) representation of the same object. ‘SEE ALSO: repr 96; operator 47, {ile « Newrstyle base class for file objects Under Python 2.2, tis possible to create a custom fie-ike object by inheriting from the buit-in class ile. In older Python versions you ‘may only create like objects by defining the methods that define an object as “le-ke." However, even in recent versions of Python, inheritance trom fle buys you litle—i the data contents come from somewhere other than a native filesystem, you will have to reimplement every method you wish to suppor. Even mare than for other objec types, what makes an object fleike is atuzzy concept. Depending on your purpose you may be happy with an object that can only read, or one that can only write. You may need to seek within the object, or you may be happy with a linear stream. In genera, however, le-ke objects are expected to read and write strings. Custom classes only need implement those ‘methods that are meaningful to them and should only be used in contexts where their capabilties are sufcient. In documenting the methods of fe-ike objects, | adopt a sight diferent convention than for other built-in types. Since actually inheriting from fe is unusual, | use the capitalized nameFILE to indicate a general fe-ike object. Instances of the actuafle class are examples (and implement all the methods named), but other types of objects can be equally good FILE instances. BUILT-IN FUNCTIONS open(iname [,mode [,buffering]}) file(Iname [,mode [,buffering]]) Return a fle objec that attaches to the lename fname. The optional argument mode describes the capabilies and access style ofthe object. An r mode is for reading; w for writing (tuncatng any existing content) for appending (wrting to the end). Each of these modes may also have the binary lag b for platforms like Windows that distinguish text and binary fies. The lage may be used to allow both reading and weing. The argument buffering may be O for none, 1 for ine-oriente, a larger integer for numberof bytes. >> open(tmpW’).write'spam and eggsin) >> print opentimp, read), spam and eggs >>> open(tmp Ww). write(this and thatin) >> print opentimp,''.read0), this and that >> open(tmp!'a}.wite(something else'n) >> print openttmp}''.read), this and that something else METHODS AND ATTRIBUTES FILE.close() Close a fie object. Reading and writing are disallowed attra fle is closed. FILE.closed Return a Boolean value indicating whether the fle has been closed. FILE.fileno() Return a fle descriptor number for the fle. Fle-ke objects that do not attach to actual les should not implement tis method. FILE-flush() \Write any pending data to the underiyng fle, Fle-ke objects that donot cache data can sil implement tis method as pass. FILE.isatty() Return a Boolean value indicating whether the fle is @ TTY-Ike device. The standard documentation says that fle-ke objects that do not attach to actual fies should not implement this method, but implementing ito abways retum Ois probably a better approach. FILE.mode [Attribute containing the mode ofthe fle, normally identical tothe mode argument passed tothe object's initialize. FILE.name The name o the il. For fle-ike objects without filesystem name, some string identiying the object should be put into ths attibute FILE.read ({size=sys.maxint]) Return a string containing up to size bytes of content from the fle. Stop the read if an EOF is encountered or upon another condition that ‘makes sense forthe objec ype. Move te file position forward immediately past the readin bytes. A negative size argument is treated as the default value, FILE-readline((si yys.maxint]) etum a string containing one line from the fie, including the trailing newine, it any. A maximum of size bytes are ead. The file positon is moved forward past the read. A negative size argument is treated as the default value. FILE-readlines({size=sys.maxint]) Return alist of tines from the fle, each tne including its traling newine. I the argument size is given imt the read to approximately size bytes worth of lines. The file postion is moved forward past the read in bytes. A negative size arguments eated as the defaut value. FILE.seek(offset [,whence=0]) Move the file postion by offset bytes (postive or negative). The argumentwhence species where the inital fle poston is prior to the ‘move: 0 for BOF; for current postion; 2 for EOF. FILE.tell() Return the current file positon. FILE.truncate({size=0]) Truncate the fle contents (it becomes size length). FILE.write(s) \Write the string s to the fl, starting at the curent fle position. The file position is moved forward past the writen bytes. FILE.wri ines(lines) \Write the tines in the sequence ines to the file. No newlines are added during the wit. The fle position is moved forward past the wart bytes. FILE.xreadlines() Memory-eticient iterator over ines ina fe. In Python 224, you might implement ths as @ generator that etums one line per eachyeld. SEE ALSO: xreadines 72; Int New-style base class for integer objects long « New-style base class for long integers In Python, there are two standard datatypes for representing integers. Objects of type IntType have a fixed range that depends on the Lnderying platform—usually between plus and minus 2°31. Objects of type LongType are unbounded in size. In Python 2.2+, operations on integers that exceed the range ofan int object results in automatic promation talong objects. However, no operation on a long will demote the result back to anint object (even if the result is of small magnitude}—with the exception of teint) function, of course From a user point of view ints and longs provide exactly the same interface. The difference between them is only in underying implementation, with ints typically being significantly faster to operate on since they use raw CPU instructions fay directy). Most ofthe ‘magic methods integers have are shared by floating point numbers as well and are discussed below. For example, consult the iscussion of loat,_mul__)for information on the corespondingint__mul__() method. The special capability that integers have over floating point numbers i their abilty to perform bitwise operations. Under Python 2.2+, you may create a custom datatype tha inherits from int or long; under earlier versions, you would need to manually Gefin al the magic methods you wished to utilize (generally alt of work, and probably not worth it) Each binary bit operation has alefassociative and a rightassociatve version. If you define both versions and perform an operation on two custom objects the let- associative version is chosen. However, i you perform an operation witha basicint and a custom object, the custom rght- associative method wil be chasen over the basic operation. Far example: > class ln def _xor_(slt, other): retum "X0R" ef _rvor_(sel, oer} retum "RXOR" o> OFF OXF ° >> OXFF * (xFF) "RKOR’ >> IOKFF) * OXF "xOR’ >>> HOKFF) *OXFF) "xoR" METHODS int._and__(self, other) int._rand__(self, other) Return a bitwise-and between sell andother. Determines haw a datatype responds tothe operator. int._hex_(self) Return a hex sting representing sell. Determines how a datatype responds tothe builinhexi) function. int_invert_(self) Return a bitwise inversion of sell. Determines haw a datatype responds to the~ operator. ift_(self, other) int._rlshift_(self, other) Return the result of bit-shiting sitio the let by other bts. The rght-associatve version shits other by selfbits. Determines how a datatype responds to the << operator. int._oct_(self) Return an octal string representing ell. Determines haw a datatype responds to the built function. int._or_(self, other) int._ror__(self, other) Return a bitwise-or between self and other. Determines how a datatype responds to the| operator. int._rshift__(self, other) int._rrshift__(self, other) Return the result of bit-shiting self tothe right byother bts. The rght-associative version shits other by self bits. Determines how a datatype responds to the >> operator. int._xor_(self, other) int._rxor_(self, other) Return a bitwise-xor between self and other. Determines how a datatype responds to the* operator. ‘SEE ALSO: float 19;in 421;ong 422; ys.maxint 60; operator 47; oat - New-style base class for floating point numbers. Python floating point numbers are mostly implemented using the underying C floating point library of your platform: that i, toa greater or lesser degree based on the IEEE 754 standard. A complex number is usta Python object that wraps a pair of floats wth afew extra operations an these pars. DIGRESSION Although the details are far outside the scope ofthis book, a general waring isin order. Floating point math is harder than you think! It you think you understand just how complex IEEE 754 math is, you are not yet aware of al ofthe subtleties. By way of indication, Python luminary and erstwhile professor of numeric computing Alex Martel commented in 2001 (on ): Anybody who thinks he knows what he's doing when floating point i involved IS either nave, or Tim Peters (wel, it COULD be W. Kahan I guess, but I dont think he writes here). Fellow Python guru Tim Peters observed: ind its possible o be both (wink). But nothing about fp comes easly to anyone, and even Kahan works his butt of to come up withthe amazing things that he does. Peters illustrated futher by way of Donald Knuth (The Art of Computer Programming, Third Edition Addison-Wesley, 1997; ISBN: 0201896842, vol. 2, p. 223) Many serious mathematicians have attempted to analyze a sequence of floating point operations rigorously, but found the task so formidable tha! they have tried to be content with plausibilty arguments instead, The tick about floating point numbers is that although they are extremely useful for representing real-fe(racional) quantities, operations on them do not obey the arithmetic rules we learned in middle schoo!: associativity transitivity, commutatviy; moreover, mary very ordnary-seeming numbers can be represented only approximately with floating point numbers. For example: poe iS 0,33833838899833831 oo 3 0.29899990999999099 poe 7== 7/25" 25 ° poe T== 7,24" 24 1 CAPABILITIES In the hierarchy of Pytion numeric types, lating point numbers are higher up the scale than integers, and complex numbers higher than floats. Thats, operations on mixed types get promoted upwards. However, the magic methods that make a datatype “loat-ike" are strcty a subset of those associated with integers. Al ofthe magic methods listed below for floats apply equally taints and longs (or integer-ike custom datatypes). Complex numbers suppor a few adltion methods. Under Python 2.2+, you may create a custom datatype that inherits from flat or complex; under earlier versions, you would need to ‘manually define all the magio methods you wishes to utiize (generally alot of work, and probably nat worth it) ach binary operation has a left-associatve and a rght-associatve version. Ifyou define both versions and perform an aperation on two custom objects, the lef- associative version is chosen. However, if you perform an operation with a basic datatype and a custom object, the custom right-associative method willbe chosen over the basic operation, See the example under int METHODS float._abs_(self) Return the absolute value of sel. Determines how a datatype responds tothe builtin funcionabs(. float. float __add_(self, other) ._radd_(self, other) Return the sum of sell andther. Determines how a datatype responds tothe + operator. float._cmp__(self, other) Return a value indicating the order of set and other. Determines how a datatype responds to the numeric comparison operatorse, >, <= >=,==,<, anda. Also determines the behavior of the bui-nempy)functon. Should return -1 frselt> Istappend{ i) o> Ist.sort) Traceback (most recent cal ast File "cstdins, ine 1, in? ‘TypeEor: cannot compare complex numbers using <, <=, >, Its true that there is no obvious corect ordering between a complex number and anather number (complex or otherwise), but there fs also no natural ordering between a string, a tuple, and a number. Nonetheless, its frequently useful o sort a heterogeneous list in order to create a canonical (even if meaningless) order. In Python 2.2+, you can remedy ths shortcoming of recent Python versions in the style below (under 2.1 you are largely out of luck): >> class C{complex): ef_t_(selt 0} ithasatiro, imag} retum (sltsealseltimag) < (orealo.imag) ise: retum sel.real <0 ef_le_(set, 0): return self <0 or sel et_gt_(sel,):retum not (sel!=-0 or se <0) ef _ge_(self, 0} return set > 0 or selt==0 pop tst= [tr 1.0,1, 1, (1.29), C19, C22 poe Ist.sot) poo tst 11.0, 1, 1L, (144), 2-2), St, (1,2, 3)] CO course, it you adopt this strategy, you have to create all of your complex values using the custom datatype C. And unfortunately, Unless you override arithmetic operations also, a binary operation between aC object and another number reverts to a basic complex. datatype. The reader can work out the detals ofthis solution i she needs it METHODS complex.conjugate(self) Return the complex conjugate of se. A quick refresher her: selfs n+mis conjugate is rom). complex.imag Imaginary component of a complex number. complex.real eal component of a complex number. ‘SEE ALSO: float 19; complex 422; UserDict Custom wrapper around dictionary objects dict - Newestyle base class for dictionary objects Dictionaries in Python provide @ welLoptimized mapping between immutable objects and other Python objects (see Glossary entry on, “immutable. You may create custom datatypes that respond to various ictonary operations. There area few syntactic operations associated with dictionaries all involving indexing with square braces. But nike with numeric datatypes, there are several regular ‘methods that are reasonable to consider as pat ofthe general interface for dctionary-ike objects It you create a ditionay-ike datatype by subcassing ftom UserDict UserDit ll the special methods defined by the parent are proxies to the true dictionary stored inthe objects data member. I, under Python 22+, you subclass from itself, the object itself inherits dictionary behaviors. In either case, you may customize whichever methods you wish, Below is an example ofthe two styles for subclassing a dictionary ke datatype: >> from sys import stderr >> from UserDict import UserDict >> class LogDictOl\UserDict det _setitem_(sef, key, val} stderr. mite(Set:"sstkey)+">"sstrval}e"") sel data[key] = val >> Idofthis] = that’ Set: this->that > class LogDictNew(cict) det _setitem_(sef key, val} stderr. mrte(Set:"sstikey)+">"sstrval}e") dict_setitem (self, key, val) >>> dn = LogDictoa() >> Idnfthis] = that Set: this->that METHODS dict,_emp__(self, other) UserDict.UserDict._cmp__(self, other) Return a value indicating the order of se and other. Determines how a datatype responds to the numeric comparison operatorse,>, <=, >=,=5,<, andle. Also determines the behavior ofthe bui-nemp))functon. Should return 1 forselt class BagOfPairs(alct): det _getitem_(sel, x): if sett nas_keyix) return (x, dct__getitem_(set,x)) else: ‘mp = dleti(vk) for kw in slfitems) return (cit,_gettem _(tmp.x) x) >>> bop = BagOPais((thisthat,'spam'“eggs}) >> bopfthis] (tis, that) >> bopleggs!) (spam, '2998) >> bop[bacon] ='sausage’ >> bop (this that’, tuacon’ sausage’, ‘spam eggs) >> bop [nowhere] Traceback (most recent cal ast File "cstdins, ie 1, in? File "“cstdins, ine 7, in_getitem _ KeyEtror: nowhere dict,_len_(self) UserDict.UserDict._len__(self) Return the length ofthe dictionary. By default his is simply a count ofthe key/val pairs, but you could perform aciferent calculation it you wished (@9, pethans you would cache the sizeof a record set etumed from a database query that emulated a dictionary) Determines how a datatype responds tothe bul-inJen( function setitem__(self, key, val) UserDict.UserDict._setitem_(self, key, val) Sel the dictionary key key to value val Determines how a datatype responds to indexed assignment; tha is see version might actually perform some calculation based on val andior key before adding an item. dict.clear(self) UserDict.UserDict.clear(self) Remove all items from sel, dict.copy(self) UserDict.UserDict.copy(self) Return a copy of the dictionary self Le. a cstnct object with the same items). dict get(self, key [,default=None]) UserDict.UserDict.get(self, key [,default-None}) Return the value associated with the Key key. Ino item withthe Key exists, returndefault instead of raising aKeyError dict has_key(self, key) UserDict.UserDict.has_key(self, key) Return a Boolean value indicating whether selfhas the keykey. dict.items(self) UserDict.UserDict.items(self) dictiteritems(self) UserDict.UserDict iteritems(self) Return the items ina dictionary in an unspecified order. The tems) method retums a true list of key, val) pais, while th.iteritems() ‘method (in Python 2.24) retums a generator object that successively yields items. The later method is useful f your dictionary isnot a true in-memory structure, but rather some sor of incremental query or calculation. Ether method responds externally similay to a for loop: poe d=(12, 8:4) o> forkyvin diteritems)) rit ky 12:34 >> forkyvin ditems() print; 12:34 dict keys(self) UserDict.UserDict.keys(self) dictiterkeys(self) UserDict.UserDict.iterkeys(self) Return the keys in a cictonary, in an unspecified order. The .eys() method returns a tre list of Keys, while the.itrkeys() method (in Python 2.24) retums a generator object. SEE ALSO: dct.itemst) 26; dict.popitem(self) UserDict.UserDict.popitem(self) Return a (key val) pa for the dictionary, or raise asKeyErrorif the dictionary is empty. Removes the retumed item from the dictionary. As with other dictionary methods, the order in which items are popped is unspecitied (and can vary between versions and platforms) dict.setdefault(self, key [,default=None}) UserDict.UserDict.setdefault(self, key [,defaul '=None]) It keys currently in the cletionay,retum the corresponding value. key is not curenty inthe dictionary, setselfkey]=defaut, then return cetaut. SEE ALSO: dct get) 25; dict.update(self, other) UserDict.UserDict.update(self, other) Update the dictionary se using the dictionary other. If key inther already exists inset the corresponding value fromotheris used in self alkey.val pair nother isnot inset tis added dict.values(self) UserDict.UserDict.values(self) dict itervalues(self) UserDict.UserDict.itervalues(self) Return the vaiues ina dictionary, in an unspecified order. The values() method returns a tue lst of Keys, while the itervalues() method (in Python 2.24) returns a generator object. SEE ALSO: dct.itemst) 26; ‘SEE ALSO: dict 425; lst 28; operator 47; UserList» Custom wrapper around list objects list: New-siyle base class for list objects {tuple « New-style base class for tuple objects [A Python lst isa (possibly) heterogeneous mutable sequence of Python objects. A tuple isa similar immutable sequence (see Glossary entry on immutable"). Most ofthe magic methods of sts and tuples are the same, but a tuple does not have those methods associated with internal transformation It you create alst-ke datatype by subclassing from UserList UserList, al the special methods defined by the parent are proxies tothe ttue list stored in the object's dala member. If, under Python 2.2, you subclass tromist or tuple) itself, the object itself inher lst (tuple) behaviors. In ether case, you may customize whichever methods you wish, The ciscussion of dctand UserDict shows an example at the different styles of specialization. The dierence between a lis-ke object and a tuplesike object runs less deep than you might think. Mutabilty is only really important for Using objects as dictionary keys, bt dictionaries only check the mutabilty ofan object by examining the retum value of an objects __hash_() method. If this method fais o retum an integer, an objects considered mutable (and ineligible to serve as a dictionary key). The reason thal tuples are useful as keys is because every tuple composed of the same items has the same hash; two lists (ot > pint det (1,2, 31:93, 7:8) poe dettst) 33 > Istappendl4) >> print det (11,2, 4:33, 7:8) poe dels!) Traceback (most recent cal ast File "“cstdins, ine 1, in? KeyEzror:[1,2,3, 4] {As soon as 1st changes, its hash changes, and you cannot reach the ictonary item Keyed to It, What you need is something that does. rat change as the object changes: oe class L(t) __hash__=lambda sel: idsett) poo tst=L("123) poe det = (It33, 78) poe dels!) 33 >> tstappend(4) poe det (11,2, 3,433, 7:8) poe deftst] 33 ‘As with most everything about Python datatypes and operations, mutability is merely a protocol that you can choose to support or not support in your custom datatypes. ‘Sequence datatypes may choose to support order comparisons—in fact they probably should, The methods ._cmp_0,.__9@_() __Gt_(),.__le__(,and.__ft_()have the same mearings for sequences that they do for other datatypes; seepperator, fat, and dlctfor etal, METHODS \dd_(self, other) tuple._add_(self, other) list._iadd__(self, other) UserList.UserList._iadd__(self, other) Determine how a datalype responds tothe + and += operators. Augmented assignments ivplace adc’) are supported in Python 2.0. For lst-ke datatypes, normally the statements tst+=other and tst=1stvother have the same effect, but the augmented version might be more efficient. Under standard meaning, addition ofthe two sequence objects produces a new (distinc!) sequence object with all the items in bothselt andother. An in-place add(._iadd_) mutates the left-hand object without creating a new object. custom datalype might choose to (ive a special meaning o addition, perhaps depending on the datalype of the object added in. For example: o> class List(s) def _iadd_ (sel, other} ifissubelass(other._class_ ist) return list, iadd (sel, other) else: fom operator import add return map(add, sel, fother|"en(se)) pop xl = XList(1.2.3) poo xl += [4.56] pool 1,2,3,4,5,6) pool += 10 pool 111,12, 13, 14,15, 16] list._contains_ (self, x) UserList.UserList._contains_(self, x) tuple._contains_(self, x) Return a Boolean value indicating whether self contains the value x. Determines how a datatype responds to thein operator. list._delitem_(self, x) UserList.UserList._delitem (self, x) Remove an item from als-ike datatype. Determines how a datatype responds tothe del statement, as indel sets). list._delslice_(self, start, end) UserList.UserList._delslice_(self, start, end) Remove a range of items from alist-ke datalype. Determines haw a datalype responds tothe del statement applied toa slice, as inde! selstartend] list._getitem_(self, pos) UserList.UserList._getitem_(self, pos) tuple.__getitem_(self, pos) Return the vaiue at offset pos inthe Ist. Determines how a datatype responds to indexing with square braces. The default behavior on lst indices isto raise an IndexE:or for nonexistent offsets. list._getslice_(self, start, end) UserList.UserList._getslice_(self, start, end) tuple.__getslice_(self, start, end) Return a subsequence ofthe sequence selt. Determines how a datatype responds to indexing witha slice parameter, as inselfstar:end list._hash_(self) UserList.UserList._hash__(self) tuple._hash__(self) Return an integer that distinctly identiies an object. Determines how a datatype responds to the buit-n hash function—and probably ‘mare importantly the hash is used intemally in dictionaries. By default, tuples (and other immutable types) will return hash values but lists wil aise @Type€rtor. Dictionaries wil handle hash colisions gracefully, but tis best to try to make hashes unique per objec. o> hash(219750523),hash(1,2)) (219750523, 219750523) >>> det = (219750528:1,(1.2)2) >>> det{219750523] 1 list._len_(self UserList.UserList._len__(self tuple._len_(self Return the length ofa sequence, Determines how a datalype responds tothe bulin len) function. list._mul_(self, num) UserList.UserList._mul_(self, num) tuple._mul_(self, num) list._rmul__(self, num) UserList.UserList._rmul_(self, num) tuple._rmul_(self, num) list._imul__(self, num) UserList.UserList._imul__(self, num) Determine how a datatype responds tothe * and= operators. Augmented assignments n-place ade’) are supported in Python 2.0+. For lst-ike datatypes, normally the statements It"-other and|st-Istother have the same effect, but the augmented version might be mare efficient. The right associative version ._rmul_() determines the value ofnum'self, the lef-associatve.._mul__() determines the value of selnum, Under standard meaning, the product of a sequence and a number praduces a new (distinct) sequence object withthe items in self upicated num times: poo [12a}°3 1,2,3,1,2,3,1,2,3] list._setitem_(self, pos, val) UserList.UserList._setitem (self, pos, val) Sel the value at offset pos to value value. Determines how a datatype responds to indexed assignment; that is selfpos}=val. A custom version might actually perform some calculation based on val andior key before adding an item. list._setslice_(self, start, end, other) UserList.UserList._setslice_(self, start, end, other) Replace the subsequence selistatend] with the sequence other, The replaced and new sequences are not necessarly the same length, and the resulting sequence might be longer or shorter than sell, Determines how a datatype responds to assignment toa slice, as in selfstart.end}-other. list.append(self, item) UserList.UserList.append(self, item) ‘Add the object tem tothe end ofthe sequenceselt Increases the length of selfby one. list.count(self, item) UserList.UserList.count(self, item) Return the integer numberof occurrences of item inselt list.extend(self, seq) UserList.UserList.extend (self, seq) ‘Add each item in seqt the end ofthe sequencesselt Increases the length of selfbylen{seq) list.index(self, item) UserList.UserList.index(self, item) Return the offset index ofthe fst occurrence of item in sel. listinsert(self, pos, item) UserList.UserList.insert(self, pos, item) ‘Add the object tem tothe sequence self before the offsetpos. Increases the length of selfby one. list.pop(self [,pos=-1]) UserList.UserList.pop(self [,pos=-1]) Return the item at offset pos of the sequence self, and remave the returned tem from the sequence. By default, remove the last item, Wich lets a ist act ke stack using the pop() and.appendi) operations. list.remove(self, item) UserList.UserList.remove(self, item) Remove the frst occurrence of tem insell. Decreases the length of self by one. list.reverse(self) UserList.UserList.reverse(selt) Reverse the lst selfin place. list.sort(self [empfunc}) UserList.UserList.sort(selt [,empfunc]) Sort the list selfin place. Ita comparison functonemptunc is given, perform comparisons using that function, ‘SEE ALSO: lst 427;tuple 427; dict 24; operator 47; UserString - Custom wrapper around string objects ‘str + New-style base class for string objects {string in Python isan immutable sequence of characters (see Glossary nity on “immutable. There is special syntax for creating strings—single and tiple quoting, character escaping, and so on—butin terms of object behaviors and magic methods, most of what a string does a tuple does, too. Both may be sliced and indexed, and both respond to pseudo-arithmetic operators + and" For the strand UserSting magic methods that are sticty a matter ofthe sequence quay of strings, see the correspondingluple ocumentation. These include siz__add_sir__getitem__(sir__getsice_() st._hash_() str__len_(), str__mmul_(), and str__rmul_(). Each ofthese methods is also defined inUserStrng. The UserSting module aso includes a few expicit defntions of ‘magic methods that are not inthe new-stylestrclass: UserStrng._iad_(, UserString._mul__), and UserString._radd_( However, you may define your own implementations of these methods, even i you inher from str (in Python 2.24). In any case, internally, in-place operations are stl perarmed on all strings. Strings have quite a number of nonmagic methods as well. If you wish to create a custom datatype that can be utilized inthe same functions that expect stings, you may want to specialize same of these common sting methods. The behavior of string methads is documented inthe discussion ofthe string module, even forthe few string methads that are not also defined in thatring module. However, inerting from either stror UserString provides very reasonable detault behaviors for all these methods. SEE ALSO: ".capitalze()192;"te() 199;" center() 133:"count) 134;""endswith\) 124;" expandtabs() 194;" final) 135;".index() 95," isalphal) 136; isalnum) 136;"scigt) 16," slower) 136;" isspacel)136;" iste) 136;""Isupper() 198;"join) 137;"ust) 138;" lower() 138;""Istip() 139;""replace() 139; find) 140;""rindex() 141;" just) 141; rtp) 142;"- spit) 142;"splitines() 144; ™ statswith?) 144;"" stip) 144;":swapcase() 145;""translate() 145:" upper|) 146:".encode() 188, METHODS str._contains_(self, x) UserString.UserString,_ contains _(self, x) Return a Boolean value indicating whether self contains the characters. Determines how a datatype responds to thein operator. In Python versions through 2.2, the in operator applied to stings has a semantics thal tends to trip me up. Fortunately, Python 2.3+ has the behavior that | expect. n older Python versions, in can only be used to determine the presence of a single character ina string—this ‘makes sense if you think ofa string as a sequence of characters, but nonetheless intuitively want something lke the code below to work: poe s="The cat inthe hat" >>> if the" in : print "Has definite article” Traceback (most recent cal ast: File "estdin", ine 1, in? Type€:or:'n requires character as left operand. Itis easy to get the “expected” behavior in a custom strng-ke datatype (aril sill always producing the same result whenever x is indeed a character o> class Siet) det _contains_(sel, x): for in range(ientset): 1 sft startwith(): return 1 >5> 8 = (The cat in the hat’) poo The" ins. 1 ° Python 2.3 strings behave the same way as my datatype S. ‘SEE ALSO: string 422; sting 129; operator 47; tuple 28; 4.1.5 Exercise: Filling out the forms (or deciding not to) DISCUSSION particular litle task that was quite frequent and general before the advent of Web servers has become absolutely ubiquitous for sighty ‘dynamic Web pages. The pattem one encounters is that one has a certain general format tha is dested for a document or ile, but miscellaneous litle details citer rom instance to instance. Form letters are another common case where one comes across ths patter, but thematically related collections of Web pages rule the roost of templating techniques. Itturns out that everyone and her sister has developed her own ite templating system. Creating a templating system isa very appealing task fy most scriting languages, just litle while after they have gotten afm grasp of “Hello Word!” Some of these are discussed in or (i ‘many others are not addressed. Often, these templating systems wil be HTMLICGI oriented and wil often include some degree of dynamic calculation of fil-in values—the inspiration in these cases comes from systems lke Alaire's ColdFusion, Java Server Pages, Active Server Pages, and PHP, in which some program code gets sprinkled around in documents that ate primarily made of HTML. At the very simplest, Python provides interpolation of special characters in stings in a style similar tothe C sprint function, So a simple ‘example might appear lke >> form_leter Dear %4s %s, You owe us $%s for account (#%s). Please Pay. The Company" > Iname = ‘David poe name = Mert >> due = 500 o> acct ="128-7745 >>> print frm _letter% (Iname,Iname,due,acct) Dear David Mertz, You owe us $500 for account (#129-T745). Please Pay. The Company This approach does the basic templating, but it would be easy to make an error in composing the tuple of insertion values. And ‘moreover, a slight change tothe form etter terplate—such as the addition or subtraction of a field—would produce wrong results Abit more robust approach is to use Python's dictionary-based string interpolation. For example: >> form_leter “Dear 94(tname)s %liname)s, ‘You owe us $%é(due}s for account (#7e(act}s). Please Pay. The Company" >> fields = (Iname"’Mertz’, name'"David) >> fields[acet] ="123-T745) >>> fields[due] = 500, >> fieldsflast letter] ="01/02/2001" >> pint form letter % elds Dear David Mertz, You owe us $500 for account (#129-T745). Please Pay. The Company \With this approach, the fields need not be liste in a particular order forthe insertion. Furthermore, ifthe order of fields is rearranged in the template ori the same fields are used fra ltferent template, the fields dictionary may sill be used fr insertion values. Welds has unused dictionary keys, it doesnt hurt the interpolation, ether. ‘The dictionary interpolation approach is still subject to failure it ectionary keys are missing. Two impravements using the UserDict ‘madule can improve matters in wo diferent (and incompatible) ways. In Python 2.2+the builtin dict type can be a parent for a “new-syle class if available everywhere you need it to run, dicts a better parent than isUserDict UserDict One approach isto avoid all key misses during dictionary interpolation: >> form_letier=""9{salutation)s "(names %(hname}s, ‘You owe us $%é(due}s for account (#7e(act}s). Please Pay. ‘(losing)s The Company"™ >> from UserDict import UserDict >> class AutFilingDict(UserDit) et _ init{(self.dict=): UserDiet_init__(sett dict) def _getitem_(selt key): return UserDictgetsel, key, ") oe fields = AutoFilingDictt) >>> flelds[salutaion] = Dear poe fields (salutation: Dear) >> flelds{iname'] ="David! >>> fieds[due] = 500, >>> fields [closing] ="Sincerely; >> pint form letter % elds Dear David, You owe us $500 for account (#), Please Pay. Sincerely, The Company Even though the fields Iname and acct are not specified, the interpolation has managed to produce basically sensible letter (stead of crashing witha KeyEsto). Another approach is to create a custom dictionary-ike objec that wil allow for “partial interpolation." This approach is particulary useful to gather bits ofthe information needed for the final sting over the course ofthe program run (rather than all at once) >>> form_let "2e(saluaton)s %(iname)s %{Iname)s, ‘You owe us $%é(due}s for account (#7e(act}s). Please Pay. ‘(losing)s The Company"™ >> from UserDict import UserDict o> class ClosureDicl(UserDiat def _init_(slf.sict=): UserDict_init__(stt dct) det _getiter_ (sel key): relum UserDictgetsel, key, %( +keys)s}) >>> name_dict = ClosureDict(name’’David name'’Mertz)) >>> print form _leter % name. dict ‘i salutation)s David Mertz, You owe us $%4(due)s for account (#%act)s), Please Pay. ‘Wclosing)s The Company Interpolating using a ClosureDict simply fils in wnatever portion ofthe information it knows, then returns a new sting that is closer to being filed in SEE ALSO: dict 24; UserDict 24; UserList 28; UserString 33; QUESTIONS 4: What ate some other ways to provide “smant” string interpolation? Can you think of ways thatthe UserList or UserSting ‘modules might be used to implementa similar entanced interpolation? 2: Consider other "magic" methods that you might add to classes inheriting trom UserDict UserDict How might these ‘addtional behaviors make templating techniques more powertul? 3: How far do you think you can go in using Python's string interpolation as a templating technique? At what point would you decide you had to apply other techniques, such as regular expression substitutions ora parser? Why? \What sorts of error checking might you implement for customized interpolation? The simple list or dictionary interpolation could fail fairy easly, but at leat those were trappable erors they let the application know something is amiss). How would you create a system with both flexible interpolation and good guards on the quality and completeness of the final result? 1.1.6 Problem: Working with lines from a large file Atits simplest, reacing a fle in aline-riente style is ust a matter of using the readline), readines(), and xreadines() methods ofa file object. Python 2.2+ provides a simplified syntax for this frequent operation by letting the fle object itself efficiently iterate over lines (stiely in forward sequence). To readin an entire fie, you may use the sead{) method and possibly split into lines or other chunks Using the string pit function. Some examples: >> orline in opentchapltxt): # Python 2.2+ ‘# process each ine in some manner pass >>> lnelist = open chap.) readlines() >> pint nest 1849], EXERCISE: Working with ines from a large fle >> bt = opent(chapt bt’) read) >> rom os import inesep >> lnelist2 = bt split{inesep) For moderately sized fles, reading the entire contents isnot a big issue. But lrge les make time and memory issues more important. Complex documents or active og fles, for example, ight be multiple megabytes, or even gigabytes, in size—even if the contents of such files do not sticly exceed the sizeof ava 39. Arak ue to those iscussed here is discussed in the Problem: Reading a fle backwards by recor line. or paragrapt| section of_Phapter. Obviously if you need to process every line ina file, you have to read the whole flereaalnes does so in a memory-rendly way, assuming you are able to process them sequentially. But for applications that only needa subset of ines ina large fle itis not hard to ‘make improvements. The most important module to look to for support here is inecache. ACACHED LINE LIST Itisstraightfonmard to read a paticular ne from file using linecache: >> import linecache >>> print inecache, getline chap! txt’ 1850), PROBLEM: Working wit ins from a large fle Notice that lnecache getine() uses one-based counting, in contrast tothe zero-based lst indexing in the prior example. While there is ‘ot much to this, ft would be even nicer to have an object that combined the efficiency of linecache with the interfaces we expect in lists. Existing code might exist to process lists of lines, or you might want to write @ function thats agnastic about the source ofa ist af lines. In adition to being able to enumerate and index, i would be useul tobe able to slice linecache-based objects, ust as we might do to realists (including with extended slices, which were added to lists in Python 2.3), cachedlinelist.py import linecache, types class CachedLineLis: 4 Note: in Python 2.2, tis probably worth including 4 __slots_= (fname!) {and inherting fom ‘object def _init_{sel, name) ‘sell_fname = name def _getitem_(sel, x): ittype() is types.SlceType: return [inecache.geline(selt_ name, n+1) form inrange(xstart, x-top, x-step)] ise: return tinecache-getline(selt. fname, x+1) def _getslice self, beg, end) ‘#pass to _getiem which does extended sioes also return selfbeg:end:1] Using these new objects is almost identical to using alist created by open(tname).readlines(), but more efficent (especialy in memory usage): >> rom cachedinelist import CachedLineList oe ol = CachedLineList(/chapt.txt) >> ol [1849] " PROBLEM: Working with tines from a large fein >> for line in of1849:1851]: print ine, PROBLEM: Woeking wit ines from a large fle >> for ine in clf1853:1857-21 print ine, ‘matter of using the ‘readline, ‘readlines( and ‘simpltied syntax for this frequent operation by letting the A RANDOM LINE Cccasionally—especially for testing purposes—you might want to check typical” lines ina line-riented file. Itis easy to fll into the trap cof making sure that a process works for the first few ines ofa fle, and maybe forthe last few, then assuming it works everywhere. Unfortunately, the frst and last few lines of many files tend tobe atypical: sometimes headers or footers are used: sometimes lg fle's first ines were logged during development rather than usage; and so on. Then again, exhaustive testing of enti fles might provide ‘mare data than you want to worry about. Depending on the nature ofthe pracessing, complete testing could be ime consuming as well (On most systems, seeking to a particular positon in afl is far quicker than reading al the bytes up to that postion. Even using Iinecache, you need to read afl byte-by-byte up to the point ofa cached line. fast approach to finding random ines trom a large fle is to seek to @ random postion within a fle, then read comparatively few bytes before and after that postion, identiying a line within that chunk. randline.py susnbinipython "rtterate over random ines ina file (req Python 2.24) From commande use: % randine.py import sys, from os import stat, inesep from stat impor ST_SIZE from random impor rancrange MAX_LINE_LEN = 4096 erable class class randiine(object —slots_ = (_fp'_size' limit) def_init__(sel, name, imit-sys.maxint} def_iter_(se return selt def nextset) if set_limit <= 0: raise Stoplteration sel_limit-= 1 os = randrange(selt_ size) priorlen = minipos, MAX_LINE_LEN) # maybe near start sel_fp seek(pos-prirlen) 4 Add extra linesep at beglend in case pos at begiend prior = inesep + se_fpsead(ororlen) el_fpead(MAX LINE LEN) +linesep begin =prior.inlinesep) + leninesep) endln = post fndlinesep) retutn prior[begin;}spost-endin} - Use as commandline tool it_ame_=='_main_' fname, numines = 5ys.argM(],ntsys.argv2}) for nein randine(Iname, numines): print ine ‘The presented randline module may be used ether imported info anther application or as a command:ine tool. n the later case, you could pipe a collection of random ines to another application, asin °% randine.py reallybigiog 1000 | testapp ‘Acouple details should be noted in my implementation. (1) The same tine can be chasen more than ance in a line iteration. I you choose a small number af lines from a large fl, ths prabably will not happen (but the so-called "birthday paradox’ makes an occasional collision more likely than you might expect see the Glossary). (2) Whats selected is “the line that contains a random position inte fle.” which means that short lines ae less likely tobe chosen than long lines. That clstribution could be a bug or feature, depencing on your reeds. In practical tems, fr testing “enough” typical cases, the precise distribution is not ll that important ‘SEE ALSO: xreadines 72;linecache 64; random 82; Cn an nn 1.2 Standard Modules ‘There area variety of tasks that many or most text processing applications will perform, bu that are not themselves text processing tasks. For example, texts typically lve inside les, so fora concrete application you might want to check whether files exist, whether you have access to them, and whether they have certain attributes; you might also want to read their contents. The text processing per se does not happen until the text makes it info a Python value, but getting the text into local memory is a necessary step. Another task is making Python objects persistent so that inal or intermediate processing resuls can be saved in computer-usable forms, Cr again, Python applications often benefit rom being able ta cal external processes and possibly work with the resuls of those cal. Yet another class of modules helps you deal with Python internals in ways that go beyond what the inherent syntax does. Ihave made a judgment callin this book as to which such "Python interna” modules are sufficiently general and frequent used intext processing applications; a number of “intemal” modules are given only onetine descriptions under the “Other Modules” topic. 1.2.1 Working with the Python Interpreter ‘Some of the modules inthe standard library contain functionality that is nearly as important to Python asthe basic syntax. Such ‘madularty is an important strength o Python's design, but users of other languages may be surprised to ind capabilis for reading commanc-line arguments, catching exceptions, copying objects, or the lke in external modules. copy » Generic copying operations "Names in Python programs are merely bindings to underlying objects; many ofthese objects are mutable. Ths pointis simple, butt winds up biting almost every beginning Python programmer—and even a few experienced Pythoners get caught, oo. The problem is that binding another name (including a sequence postion, dictionary entry, or attribute) o an abject leaves you with two names bound tothe same object. you change the underying object using one name, the other name also points to a changed object. Sometimes you want that, sometimes you do no. (One variant o the binding trap is a particularly frequent pitfall. Say you want a2D table of values, itiaized as zero. Later on, you would lke to be abe to refer toa rowicolumn position as, for example, tabe[2(3] (as in many programming languages). Here is what you would probably try frst, along with its falure: >> row = [014 >>> pint row (0,0,0,0] >>> table =|row]'4 # or table = [[0'4)"4 >> for row in table: print row 10,0,0,0] (0,0,0,0] 10,0,0,0] >> fr row in table: print row 10,0,0,71 [0,0,0,71 10,0,0,71 [0,0,0,71 >>> idtable(2), itable(3) (6207968, 6207968) ‘The problem with the example is that table i alist of four postional bindings o the exact same list objec. You cannot change just one row, since all four pont to just one object. What you need instead isa copy of row to putin each row oftable. Python provides a numberof ways to create copies of objects (and bind them to names). Such a copy isa "snapshot" ofthe state ofthe cbject that can be modified independently of changes tothe orginal. few ways to correct the table problem are: > tablet = maptist (0)"4)'4) >>> iatablet (2) itablet[3)) (6961712, 6361808) > tabled = [1st] for tstin [10°44] >>> iatable2(2)itable2(3)) (6356720, 6356800) >> rom copy import copy (ora > tabled = map(copy, [row]"4) >>> iatable3(2) i table3(3)) (6498640, 6498720) In general, slices alvays create new list. In Python 2.2+, the constructors list) and) likewise construct newicopied listscicts (possibly using other sequence or association types as arguments) But the most general way to make a new copy of whatever abject you might needs with the copy module. lf you use thecopy module you do not need to worry about issues of whether a given sequence i alist, or merely list-ke, which the ist) coercion forces int als. FUNCTIONS copy.copy(obj) Retum a shallow copy of a Python object Mos (but nt quit al) ypes of Python objets can be coped. A shallow copy binds its clemenis/members othe same objecs as bound in the riginal—bl the obec itselscstint >> import copy poe class C: pass ‘spam >>> 02 = copy.copy(ot) >>3 ol stappend(17) poe 02st 1,2,3,171 bos ol str='eggs! poo ods ‘spam poo ol st copy.deepcopy(obj) Return a deep copy of a Python abject. Each element or member in an objects itself recursively copied. For nested containers itis Usually more desirable to perform a deep copy—otherwise you can run into problems lke the 2D table example above. poe ot = Ct) pos ot lst =[1.23] >>> 03 = copy.deepcopy(ot) >>3 ol stappendi17) poe 08st 1,2,31 poo ot Ist 1,2,3,171 ‘exceptions » Standard exception class hierarchy Various actions in Python raise exceptions, and these exceptions can be caught using an except clause. Although strings can serve as exceptions for backwards-compatbilty reasons, itis greatly preferable to use class-based exceptions. \When you catch an exception in using an except clause, you also catch any descendent exceptions. By uiizing a hierarchy of standard and user-defined exception classes, you can talor exception hancling to meet your spectic code requirements, >> class MyException(StandardError): pass poo ty raise MyException except StandardEsror: print “Caught parent” except MyException: print “Caught specie class" except: print “Caught generic leftover" Caught parent In genera, if you need to raise e yranually, you should ether use a built-in exception close to your situation, or inherit from that builtin exception. The outine in Figure 1.1}shaws the exception classes defined inexceptions. Figure 1.1. Standard exceptions Exception Root class for all built-in exceptions StandardError Base for "normal" exceptions ArithmeticError Base for arithmetic exceptions Overflow Error Number too large to represent ZeroDivisionError ividing by zero FloatingPointError Problem in floating point operation LookupError Problem accessing @ value ina collection IndexError Problem accessing e value in a sequence KeyError Problem accessing @ value ina mapping NameError Problem accessing local or global name UnboundLocalError — Reference to non-existent name AttributeError Problem accessing or setting an attribute ‘TypeError Operation or function applied to wrong type ValueError Operation or function on unusable value UnicodeError Problem encoding or decodi EnvironmentError Problem outside of Python itself 1OError Problem performing I/O OSError Error passed from the operating system WindowsError — Windows-specific OS problem AssertionError Failure of an assert statement EOFError End-of-file without ¢ read ImportError Problem importing a module ReferenceError Problem accessing collected weakref Keyboardinterrupt User pressed interrupt (ct r1~c) key MemoryError Operation runs out of memory (try ce2'ing) SyntaxError Problem parsing Python code SystemError Internal (recoverable) error in Python RuntimeError Error not falling under any other category NotImplementedError — Functionality not yet available Stoplteration Iterator has no more items available SystemExit Raised by sys.exit () Utity appictions—whether for text processing or otherwise—trequently accept a variety of commande switches to configure their behavior. In principe, and frequent in practice, al that you need to do to process commandine options is read through the lst syS.argy{] and handle each element ofthe option line. | have certainly witen my own small’sys.argy parser” more than once: iis not hardif you do not expect too much. ‘The getoptmadule provides some automation and error handing for option parsing. It takes just afew lines of code to tgtopt what cntions it should handle, and wich switch prefixes and parameter styles to use, However, getontis not necessarly the final word in parsing command ines. Python 2.3 includes Greg Ward's ootik module > import getopt >> optse"al-b-c2 ~foo=bar ~baz filet fe2" spit) >> oplst, args = getoptgetopt(ops abe: {Yo0=,baz') >> optist 2,1), (0,9, (2,29, (400, bar), (baz, >>> aigs ter, stransiate(join(mapychr.range(256)),") lambda 1:\ Gictf(nodash(op) val or optvalin 1) >>> optcict = todict(optist) p> optcit (ait e220) "baz", Yoo" bar} You can examine options given ether by looping through optist or by performingopidct.gel{key, defaul type tests as needed in your program flow. ‘operator + Standard operations as functions All ofthe standard Python syntactic operators are available in functional form using the operator madule, In most cases, iis mare clear to.use the actual operator, but in a few cases functions are useful, The mast common usage for operators in conjunction with functional programming constructs. For example: >>> import operator 1,0, 0)", abe) ee maploperatornot, 1st) # fpsiyle negated bool vals. (0.1,1,1,0) > tmplst =) imperative style >>> foritem in 1st {mpl append{not item) >> implst (0.1,1,1,0) >> del tmplst 4 must cleanup stray name ‘As well a being shorter, Ifnd the FP style mare clear. The source code below provides sample implementations ofthe functions in the ‘operator module. The actual implementations are faster and are written directly if, bu the samples ilustrate what each function does. operator2.py {#88 Comparison functions t=_t_ =lambea a.b:a- ‘84# Boolean functions rot_=__not_ = lamibda 0: not 0 truth = lamda o: nat not 0 4 Arithmetic functions abs =__abs__= abs. # same as builtin function add =_add_=lambda ab:a+b and_=__and__=lambda ab: & b # bitwise, not boolean dv=_dv_=\ lambda ab: lb # depends on _future__ CAPS % Jeap_fle py nosucifile > CAPS Could not read 'nosuchfle % Jeap_flepy > CAPS No filename speciied SEE ALSO: sys.argy 49; sys.stdin 51;sys stdout 51; sys.stdin sys._std Fie object for standard input stream (STDIN). sys._stdin__ retains the original value in casesys.stinis modified during program execution. input) and raw-input) are read tromsys. stain, bu the most typical use ofsys.stinis for piped and rediected streams on the command line. For example ‘eeat cap_stdin py ‘lusrbinveny python impor sys, sting input = sys stdinsead) Print string upperinput) hecho “this and that” | Jcap_stainpy THIS AND THAT ‘SEE ALSO: sys.argy 49; ys.stderr 80; ys stdout 51 sys.stdout sys._stdout__ Fie object for standard output steam (STDOUT). sys.__stdout retains the original vale in casesysstdoutis modified during program execution. The formatted output ofthe print statement goes to sys.stdout, and you may also use regular fle methods, such as sys.stdout mite. ‘SEE ALSO: sys.argy 49; sys stderr 50; sys stain 5 sys.version A sting consning version ntormaton onthe curent Python interpreter. Te form ofthe sing i version bul rum, build date, build_ime compl). For example: >> print sys.version 1.5.2 (80 Apr 13 1999, 10:51:12) MSC 82 bt (Inte) or >>> pint sys.version 2.2 (#1, Apr 17 2002, 16:11:12) [eco 2.95.2 19991024 (elease}] This version-independent way to find the major, minor, and micro version components should work fr 1.5-2.9.x (atleast >> from string import split >> from sys import version >>> ver tup = mapiint,spit{spit(version[0).)) +0) >>> malor, minor, point = ver_tuat3] > if major, minor) >= (1, 6): print ‘New Way" eke: print “Old Way" New Way sys.version_info {A S-tupe containing five components ofthe version numberof the current Python interpreter: (major, minor, micro, releaselevel, sera) releaselevel is a descripive phrase; the other are integers. >>> sys.version_ifo (2,20, tinal, 0) Unfortunately, this attribute was added to Python 2.0, so its tems are not entirely useful in requiing a minimal version for some desired functionality. ‘SEE ALSO: sys.version 5, FUNCTIONS sys.exit ({code=0}) Exit Python with exit code code. Cleanup actions spectied byfnaly clauses of try statements are honored, and itis possible to intercept the exit attempt by catching the SystemExit exception. You may speciy a numeric exit code for those systems that cody them: you may also specity@ string ext code, which s printed to STDERR (with the actual ext code set to 1). sys.getdefaultencoding() Return the name of the default Unicode sting encading in Python 2.0 sys.getrefcount(obj) Return the number of rlerences tothe object ob, The value returned is one higher than you might expect, because itincludes the (temporary) reference passed as the argument. > import sy, >>> sys.getretcount(x) >>> sys.getetcount(x) 6 SEE ALSO: 08 74 {types « Standard Python object types Every object in Python has a type; you can find it by using the bun uncon type(. Often Python functions use a sort ofad hoc overloading, which is implemented by checking features of objects passed as arguments. Programmers coming from languages ike C or Java are sometimes surprised by this stye, since they are accustomed to seeing multiple "type signatures" for each set of argument types the function can accept. But that isnot the Python way. Experienced Python programmers try not to rely onthe precise types of objects, not even in an inheritance sense. Ths atitude is also sometimes surprising to programmers of other languages (especialy stticaly typed). What is usually important toa Python program is wat an object can do, not what itis. In fact, thas become much more complicated to describe what many objectsre with the "ypeiciass Luniication” in Python 2.2 and above (the details are outside the scope of this book). For example, you might be inclined to write an overloaded function in the following manner: Naive overloading of argument import types, exceptions ef overloaded gettext): it ype(o) is types FileType: text = oreadl) el type(o is types.StringType: text=0 eli type(o) in (ypes.intType, types FloatType, ‘ypes.LongType, types. ComplexType: text = repr(o) eke: raise exceptions TypeE ror retum text ‘The problem with this rigidly typed code is that tis far more fragile than is necessary. Something need not be an actual FileType to read its tex, it just needs to be sufcienty“ie-ke”(e.9., a ulib.urlopen( or eStringlO.StringlO() object is fie-ike enough for his purpose) Similarly, a new-style object that descends trom types Sting Type ora UserSting UserString() object is “sing-ike" enough to return as such, and simiary for other numeric types. [better implementation ofthe function above is: “Quacks like a duck" overloading of gument ef overloaded get texto): ithasatt( read) return oead) ty: return +0 except TypeError: pass ty: retuin epr(0+o) except TypeError: pass At times, nonetheless, itis useful to have symbolic names available to name specific object types. In many such cases, an empty or minimal version ofthe type of object may be used in conjunction withthe type() function equally well—the choice is mostly stylistic p> ype) 1 >>> ype(0.0) 1 >>> typeiNone) ‘ypes.StringType BUILT-IN type(o) Return the datatype of any object o. The retum value of ths function i itself an abject of he typatypes. TypeType. TypeType objects implement _str__()and.__repr__()methods to create readable descriptions of objec types. >> print ype(t) type ‘int> > print ypeltype(t)) >> ype(t) is type(0} 1 CONSTANTS types.BuiltinFunctionType types.BuiltinMethodType ‘The type for bul functions lke abs() fen(, and i), and for functions in “standard” C extensions tkesys and os. However, extensions lke string and re ae actualy Python wrappers for C extensions, so their functions are of typdypes.FuntionType. A general Python Programmer need not worry about these fussy deta types.BufferType ‘The type for objects created by the bul buffer) function types.Class Type The ype fr user-defined classes >>> from operator import eq >>> from types impor ‘>>> map(eq, [type(C), type(C()), type(C().foo)], [ClassType, InstanceType, MethodType)) 1.1 ‘SEE ALSO: types InstanceType 66; types MethodType 56; types.CodeType The type for code objects suc as returned by compile) types.ComplexType Same as type(0+0), types.DictType types.DictionaryType Same as type(() types.EllipsisType ‘The type for builtin Ellipsis object. types.FileType ‘The type for open file objects. >> from sys import stdout > fp = openttstW) >> ltype(stdout),typette] == (types. FileTypel"2 1 types.FloatType Same as type (0.0) types.FrameType ‘The ype for rame objects such as tbtb_frame in whichto has the ype types. TracebackType. types.FunctionType types.LambdaType ‘Same as typetlambda:0) types.GeneratorType ‘The type for generator:teratorobjecs in Python 2.24. >> from _future__ import generators >>> det fo00: yield 0 >>> lypettoo) 1 >> lypettoo()) == types. GeneratorType 1 types FunctonType ‘SEE ALSO: types FunctionType 56; types.instanceType ‘The type for instances of user-defined classes. ‘SEE ALSO: lypes.ClassType 55; types MethodT ype 56; types.intType ‘Same as type(0) types.ListType ‘Same as ypel) types.LongType Same as typeOL}. types.MethodType types.Unbound MethodType ‘The type for methods of user-defined clas instances. ‘SEE ALSO: lypes.ClassType 56; types.instanceType 56; types.ModuleType The type for modules. >> import os, re, sys >>> ltype(os), type), type(sys] 1 types.NoneType ‘Same as type(None). types.StringType Same as typet") types.TracebackType The type for raceback objects found in sys.exc_traceback types.TupleType Same as type(() types.UnicodeType Same as type(u") types.SliceType The type for objects returned by sie) types.StringTypes ‘Same as (types.StingType,types.UnicodeType) ‘SEE ALSO: types StingType 87; types.UnicodeType 57: types. TypeType ‘Same as type (ype (ob) (tr any ob). types.XRangeType Same as type(arange(1) 1.2.2 Worl 1g with the Local Filesystem ‘dircache - Read and cache directory listings ‘The dcache module is an enhanced version ofthe asst) function. Unike the os function drcache keeps prior directory listings in ‘memory to avoid the need fr anew calto the fiesystem. Since ditcache is smart enough to check whether a rectory has been touched since last caching, crcache isa complete replacement foros.ist) (with possible minor speed gains) FUNCTIONS dircache.listdir(path) Return a directory listing of path path, Uses alist cached in memory where possible. dircache.opendir(path) \entica to dircache.fistalir). Legacy function to support old sop. dircache.annotate(path, Ist) Macity the list st in place to indicate which items are directories, and which are plain fies. The stirpth should indicate the path to reach the Iisted les o> |= drcache listair(mp} pool (501, mat0834.c0)] o> dieache annotate pool hm.) (5017, md10834 0b) {ilecmp + Compare files and directories The flecmp modi lets you check whether two files are identical, and whether two directories contin some identical fies. You have several options in determining how thorough of a comparison is performed, FUNCTIONS filecmp.cmp(inamet, fname2 [,shallow=1 [,use_stateache=0]]) Compare the file named by the string fname! with the fle named bythe stringname2. Ifthe default tre value ofshalow is used, the comparison is based only on the mode, size, and modification time ofthe two flles. If shallow is a false value, the fles are compared byte by byte. Unless you are concemed that someone wil deliberately falsity timestamps on files (as in a cryptography context), a shallow comparison is quite reliable. However, tar and untar can also change timestamps. >> import flecmp >> flecmp.cmp( dir fie’, ‘dr2ilet) ° >> flecmp.cmp( dirt fie, ‘dr2ile2, shallow: 1 ‘The use_statcache argument is nat relevant for Python 2.2s. In lder Python versions, thestatcache module provided (slighty) more ecient cached access to le stats, butts use is no longer needed. filecmp.cmpfiles(dirnamet, dirname2, fnamelist [,shallow=1 [,use_statcache=0]]) Compare those filenames listed in tramelistf they occur both the drectondimamet and the directory cmame2.fleemp.cmpilies() returns a tuple of tree lists (some of the lists may be emply): (matches, mismatches, errors). matches are identical les in both > import flecmp, os >>> fleemp.crmpfies( it ‘r2'[this'that’/other) (Chis, that], other) >> print os ponents “dirt read) stwnraxex t quily staff 169 Sep 27 00:13 this stwnrsxex t quily staf 687 Sep 27 00:18 that stwxraxex 1 quily staff 737 Sep 27 00:16 other srwnrsxex t quily sta 518 Sep 1211:57 spam >>> pint os ponents“ dr2) read) stwnraxex quily staff 169 Sep 27 00:13 this -twnrsaex t quily sta 692 Sep 27 00:32 that ‘The shallow anduse_slatcache arguments are the same as those tafleomp.cmp)) CLASSES filecmp.diremp(dirnamet, dirname2 Lignore=...,hid Create a directory comparison object. imamet and dimame2 are two directories to compare, The optional argumentgnore is a sequence of pathnames to ignore and defaults to ["RCS""CVS'" tags: hide isa sequence of pathnames to hide and defaults to [es.curdi.os.pardi Le. ["",.". METHODS AND ATTRIBUTES ‘The attributes of leemp dirempare read-only. Do not attempt to mosify them. filecmp.diremp.report() Print a comparison report onthe two directories. >>> myemp = flecmp.diremp(dirt'cr2) >>> myemp report) cit cit ira Only in ict: fother, ‘spam Identical fies: [this] Differing files: [that] filecmp.diremp.report_partial_closure() Print a comparison report onthe two directories, including immediate subdirectories. The method name has nothing to do with the theoretical term “closure” fram functional programming filecmp.diremp.report_partial_closure() Print a comparison report onthe two directories, recursively including all nested subdirectories. filecmp.diremp.left_list Pathnames in the dimamet directory. fitering out thehide andignore lists. filecmp.diremp.right_list Pathnames in the dimame2 directory. fitering out thehide andignore lists. filecmp.diremp.common Pathnames in both directories, filecmp.diremp.left_only Pathnames in dirname 1 but notdimame2 filecmp.diremp.right_only Pathnames in mame? but notimame!. filecmp.diremp.common_dirs ‘Subcirectories in both directories. filecmp.diremp.common_files Filenames in bath directories, filecmp.diremp.common_funny Pathnames in both crectoris, but of diferent types. filecmp.dircmp.same_files Filenames of identical files in both crectories. filecmp.diremp.diff_files Filenames of nonidentical les whose name occurs in both directories. filecmp.diremp.funny_files Fllenames in bath directories where something goes wrong during comparison filecmp.diremp.subdirs ‘Adictionary mapping fileomp dircmp.common_dis strings to coresponding fiecmp. icmp objects; for example: >>> useremp = fileomp.dremp(/Users/cuily,/Users/dam) >>> usercmp.subclts{ Public] common [Drop Box) SEE ALSO: os tal) 79:05 stair) 76: {lleinput » Read multiple fles orSTDIN Mary utities, especially on Unix-like systems, operate line-by-ine on one or more files andor on reirected input. flexibly in treating input sources in a homogeneous fashion is part ofthe “Unix phlosopty.” The fiinout module allows you to write a Python application that uses these common conventions wih almost no special programming o adjust to input sources. ‘Accommon, minimal, but extremely useful Unix uty is cat, which simply writes its input to STDOUT (allowing redirection of STDOUT as. reeded). Below are a few simple examples of cat eeata ARAAA Yecatab ARAAA BBBES Secat-bea ARAAA BBBBS Yeeat > import glob, os path >> glob glob(Users/quily'Bookichap(3-s) tt) [Users/quity/Book/chap3.t, JUsersiquity Book/chap4 tt) >>> glob globfchap[3-6 2) [ehaps.tt,chap4 tt, chap txt, 'chapé.tt] > fter(os path isd, glob globtUsers/qully/Book{A-Z1"), [/Users/quity Boo SCRIPTS’, YUsersiquity/Book XML] ‘SEE ALSO: inmatch 232; 08 path 65; linecache Cache lines from files ‘The modiulelinecache can be used to simulate relatively ecient random access to the lines ina fle. Lines that are read are cached for later access. FUNCTIONS linecache.getline({name, linenum) ead line linenum from the file namedtname. If an error occurs reading the line, the function wil catch the error and return an empty string, ys.pathis also searched forthe filename if its nat found in the current directory. >> import liecache >>> linecache.getlinetetchosts\, 15) "192.168.1.108 hermes hermes.gnosis Jani linecache. learcache() Clear the cache of rea lines linecache.checkcache() (Check whether files in the cache have been modified sine they were caches. ‘os.path + Common pathname manipulations ‘The os.path module provides a variety of functions to analyze and manipulate flesystem paths in a cross-platform fashion. FUNCTIONS os.path.abspath(pathname) Return an absolute path fra (elatve) pathname. >>> 08,path abspath(SCRIPTSImk_book) “Usersiquity!Book/SCRIPTS/mk_book’ os.path.basename(pathname) ‘Same as os path splt(pathname}(t) os .path.commonprefix(pathlist) Return the path tothe most nested parent directory shared by all elements of the sequence pathlist. >> 05,path commonprefx(/ustX1 1RBIin/twm, ‘usrsbintbast, ‘ustilocalbinidada) ‘ser os.path.dirname(pathnam: ‘Same as os path splt(pathname}(0) os.path.exists(pathname) Return tue if the pathname pathname exists os.path.expanduser(pathname) Expand pathnames that include the tide character: ~. Under standard Unix shells, an inital tide refers to a user's home directory, and tide followed by a name refers to the named user's home directory. This function emulates that behavior on other platforms. >>> 0s.path expanduser(~dam) "Usersida >>> 08 path expanduser(~/Book}) "Usersiquity’Book os.path.expandvars(pathname) Expand pathname by replacing environment variables ina Unix shell style. While ths function isin thes,path module, you could equally Use it for bastsike scripting in Python, generally ths is not necessarily a good idea, but itis possible). >> 08,path expandvars($HOME/Book) ‘Usersiquity Book’ >>> rom os.path import expandvars as ev # Python 2.0+ >>> if ev(SHOSTTYPE)=='macintosh’ and ev($OSTYPE}=='danwin print ev("The vendor is SVENDOR, the CPU is SMACHTYPE") The vendor is apple, the CPU is powerpe 0s.path.getatime(pathname} Return the last access time of pathname (or raise 0s errorif checking isnot possible) os.path.getmtime(pathname) Return the modification time of pathname (or raise os.errorif checking is not possible) os.path.getsize(pathname) Return the size of pathname in bytes (or riseos.etorif checking isnot possible). os.path.isabs(pathnam: Return tue pathname is an absolute path os.path.isdir(pathname) Return tue pathname isa directory. os.path isfile(pathname Return true pathname isa regular fle including symbol links). Return tue pathname isa symbolic ink. os.path.ismount(pathname) Return true ifpathname is @ mount paint (on POSIX systems) os.path join(patht [,path2 [...]) Join multiple path components inteligenty. >>> 05:pathjin(/Users/quly;Bo0k’ SCRIPTS mk book) ‘Usersiquty/BookSCRIPTSimk._ book’ os.path.normease(pathname) Conver pathname to canonical lowercase on case-insensive flesystems. Also convert slashes on Windows systems. os.path.normpath(pathname) Remove redundant path information. >> 08,path normpath(fusllocalbin/includel.'slang.h}) ‘ustilocaVnlude'slang.h os.path.realpath(pathname) Return the “rea” path to pathname ater de-aliasing any symbolic inks. New in Python 22+ >> 08:path ealpath(/usrbin!newalases) ‘usrsbin'sendmal os.path.samefile(pathnamet, pathname2) Return true ifpathnamet and pathname? are the same file SEE ALSO: fleamp 58 os.path.sameopentfile(fp1, fp2) Return tue if the fle handles fpt and p2 refer to the same fl, Not available on Windows. os.path. split(pathnam tur a tuple containing the path leading up tothe named pathname andthe named directory or flename insolation. >>> 08,pathspit{/Users/quityiBook/SCRIPTS) (/Users/quityBook, ‘SCRIPTS os.path. splitdrive(pathname) Return a tuple containing the drive letter and the rest ofthe path. On systems that do not use a drive letter the drive letter is emply (asi is where none is specified on Windows-ke systems). os.path.walk(pathname, visitfunc, arg) For every directory recursively contained in pathname, callvstune (ag, cname, pathnames) for each path >>> def big flles(minsize, mame, fs): {or file in fies: fullname = os path jinjdimame file) itos path sfle(tulname ‘fos path getsize(ulname) >= minsize: print uname >> 05.path.walk(/ust, big_fles, 526) JusoiibsibSystem 8 debug db Jusiibliasystem.B_proie dylio. ‘shutil- Copy files and directory trees ‘The functions inthe shut madule make working wit files abit easier. There is nothing in this module that you could nat do using basic fle objects and os.path functions, but shut often provides a more direct means and handles minor details for you. The functions ish ‘match fay closely the capabiltes you would find in Unix flesystem utes lke ep andem. FUNCTIONS shutil.copy(sre, dst) Copy te fle named sc to the pathnamedst est isa crectry, the created fle given the name 08,path join(dst+os.path.basename(src)). ‘SEE ALSO: os path jin() 66;0s path basename() 65; shutil.copy2(sre, dst) ‘Same as shut.copy() except thatthe access and creation time otist are set othe values inst. shutil.copyfile(sre, dst) Copy the fle named st to the filename dst (overwrting dst if presen). Basical, ths has the same effect as ‘open( dst wo"). rte(openisrc,"rb"). read). shutil.copyfileobj(fpsrc, fpdst [,buffer=-1]) Copy the file-ke object fpstcto the fileske objectfpdst Ifthe optional argumentbutfe is given, only the spectied number of bytes are read into memory ata ime; this allows copying very large files. shutil.copymode(sre, dst) Copy the permission bits from the fle named sc to the flename dt shutil.copystat(sre, dst) Copy the permission and timestamp data from the fle named src tothe flenamedst shutil.copytree(sre, dst [,symlinks Copy the citectory sc to the destination dt recursively. Ifthe optional argumentsymlinks isa true value, copy symbolic inks as links rather than the default behavior of copying the content ofthe lnk target. Tis function may not be entirely reliable on every platform and filesystem. shutil.rmtree(dirname [ignore [,errorhandler]}) Remove an entire directory tree rooted at dimame.If optional argument ignore isa true value, etors willbe silently ignored. lrrorhandler is given, @ custom error handler is used to catch errors. This function may not be entrely reliable on every platform and flesystem ‘SEE ALSO: open() 15; spat 69; stat - Constants/tunctions for os.stat) ‘The stat module provides two types of support for analyzing the results obs. stat), os lsat), and os.stat( cals. ‘Several functions exist to allow you to perform tests on fe Ifyou simply wish to check one predicate of a fle, itis more direct to use one ofthe os.pathis‘) functions, but for performing several such tests, it i faster to read the mode once and perform severattat S_“) tests. {As well as helper functions, stat defines symbolic constants to access the fields ofthe 10-tupe returned byos.stat/) and fiends. For example: >> rom stat import * > import os >> fleino = 0s sla(chapt txt’) >> flenfST_SIZE] 8686. >>> made = fleinfo[ST_ MODE] >>> S_ISSOCKImode) ° >> S_ISDIR(mode) ° >> _ISREG(mode} 1 FUNCTIONS stat.S_ISDIR(mode) Made indicates a directory. stat.$_ISCHR(mode) Mode indicates a character special device fe stat.S_ISBLK(mode) Made indicates a block special device fle stat.$_ISREG(mode) Made indicates a regular fie. stat.S_ISFIFO(mode) Made indicates a FIFO (named pipe) stat.S_ISLNK(mode) Made indicates a symbolic lnk stat.S_ISSOCK(mode) Made indicates a socket CONSTANTS stat.ST_MODE |-node protection mode, stat.ST_INO ‘node number. stat.ST_DEV Device. stat.ST_NLINK [Numi of links to this -node stat.ST_UID User id of le owner. stat.ST_GID Group id of le owner. stat.ST_SIZE Size of file stat.ST_ATIME Last access time. stat.ST_MTIME Maaitcation time, stat.ST_CTIME Time of last status change. {temple » Temporary files and filenames ‘The fempfle module is useful when you need to store transient data using a fle-ke interface. In contrast othe fle-ike interface ot ‘StringlO, temple uses the actual filesystem for storage rather than simulating the interface toa fein memory. In memory-constained contexts, therefore, temofleis preferable. The temporary files created by tempile are as secure against external modification as is supported by the underiyng platform. You can be faiy confident that your temporary data will not be read or changed either while your program is running or afterwards (temporary files are deleted when closed). While you should not count on fempfieto provide you with eryptographic-evel secu, iis good enough to prevent accidents and casual inspection. FUNCTIONS tempfile.mktemp( {sufi Return an absolute path toa unique temporary flename. If optional argument suffix is specified, the name wil end with thesuix sting, tempfile.TemporaryFile([mode="w+b" [,buffsize=-1 [sufi Return a temporary fle object. In general, there is litle reason to change the default mode argument of ws; there is no existing file to ‘append to before the creation, and it does litle good to write temporary data you cannot read. Likewise, the optional suffix argument generally wil not ever be visible, since the fle is deleted when closed. The default bulfsize uses the platform defaults, but may be ‘madited if needed, >> impfp = tempiile-TemporarFie() > tmpfp.write(this and than) >>> impfo.write’something elsein) >> tmpfo tell) aL > tmpfp seek(0) >> tmpfo.read() ‘this and thatinsomething else’ ‘SEE ALSO: StringlO 153, eStnglO 153, xreadlines Efficient eration over a file Reading over the lines of fie had some pitfals in older versions of Python: There was a memory ‘rendly way, and there was a fast way, but never the twain shall meet, These techniques were: >>> fp = opentbigile) >>> line = fp.eadine() o> while ine: 4# Memory trienaly but stow do stu line = fp.eactine) >for ine in open( gfe’) readines() + Fast but memory-hungry # do stu Fortunately, with Python 2.1 a more efficient technique was provided, In Python 2 2+, this efficient technique was also wrapped into a ‘more elegant syntactic form (in keeping with the new iterator). With Python 23, xreadlness oficial deprecated in favor ofthe idiom “or nein fle. FUNCTIONS xreadlines.xreadlines(fp) erate over the lines of fle object fp in an efficient way footh speed-wise and in memory usage). >>> for line in xreadines.xreadlines(apen(tmp)) + Efficient all around # do stu Corresponding to this xreadlines module function is the.xreadines() method of fle objects >> for line in opentimp xreadines) 4# As alle object method # do stu I you use Python 2.2 or above, an even nicer version is avaiable: >>> for line in opent tp) # do stu SEE ALSO: linecache 64; FILE areadlines\) 17; os.mpiile) 80; 1.2.3 Run .g External Commands and Accessing OS Features ‘commands - Quick access to external commands ‘The commands moduie exists primaly as a convenience wrapper for calls tas popen"( functions on Unicike systems. STDERR is, combined with STDOUT in the results. FUNCTIONS commands.getoutput(cmd) Return the output from running emd. This function could also be implemented as: >> def getoutputemd): impor os relutn os popen({ 'somd+';} 2>81) read) commands.getstatusoutput(emd) Return a tuple containing the exit status and output from running cmd. Ths function could also be implemented as: o> det getstatusoutput(emd) import os {fp = 0s.popent|'somd";) 2>81) ‘output = fp.read) satus = fp.cose() itnot status: status-0 # Want zero rather than None return (status, output) >> getstatusoutput's nosuchtile}) (256, Ys: nosuchfle: No such ile or crectoryn}) >> getstatusoutputs [1-3-2 (0, chapt tinchap2.btnchap3.ttn}) commands. getstatus(filename) ‘Same as commands getoutpus-d“sflename). ‘SEE ALSO: os popen() 77; os.popen2{) 77: 0s popen3() 78;0s popens() 78; (05+ Portable operating system services| ‘The os module contains a large numberof functions, atibutes, and constants fr caling on or determining features ofthe operating system that Python uns on. In many cases, functions in as are intemall implemented using modules likeposix, 082, riscos, or mac, but for portability itis better to use the os module, Not everthing inthe os module is documented inthis book. You can read about thase features that are unlikely to be used intext processing applications inthe Python Library Reference that accompanies Python distributions. Functions and constants not documented here fall into several categories. The functions and attributes os.conft),os.confst_names, s.syscont), and os.syscont_nameslet you probe system configuration. As well, | skip some functions specific to process permissions on Unicike systems: os.ctermia) os.getegia),0s.getevid),os.getgi),os.getgroups(),os.getlogin), 0. getpar\), os getppil), 0s.getuid), 0s. setegit), 0s.seteud), os setgia),0s.setoroups}) 0s. setpgrp(), 0s setpgi), os. setreud). os setregi), os sesid), and 0. setuid ui. ‘The functions os. abort), 0s.exec"(),0s._ext), 0s fork), os forkpy (),as,plock), 0s.spawn’(), o.times(), os. wait), os.waipid), os. WIF“), 0s, WEXITSTATUS)), 0s. WSTOPSIG(, ands. WTERMSIG) and the constantsos.P_‘and os.WNOHANG all deal with process creation ‘and management. These ate not documented inthis book, since creating and managing multiple processes isnot typically central to text processing tasks. However, | relly document the basic capabiitie in os.kl),os.nie(), os.statfile/), and ossystemi) and inthe 2s,popent) family. Some ofthe omitted functionality can also be found in thecommands and sys modules. ‘A numberof functions inthe os module allow you to perform low-level lO using fle descriptors. n general itis simpler to perform 10 Using file objects created with the builtin open) function or the os popen"() famiy. These fle objects provide methods lke FILE readne(), FILE write(), FILE seek), and FILE close() Information about fies can be determined using theos sta) function or functions in the os.path and shut! modules. Therefore, the unctionss.cose(),os.qup), 0s.dup2), os fpathcont), os stat), os.stavts() 0s.funcate) 0s.isaty), 0s. lseek),08.open(),os.openpty),as.pathcon),os.ppe(),os.read),os.salvis(), 0s..ogetpgrp). 0s.tosetporp) os.ttyname(), os umask(| and s.write() are not covered here. As well, the supporting constants os.0_* and cs.pathcont_names are omited. ‘SEE ALSO: commands 73;0s.path 65; shut 68; ys 49; FUNCTIONS os.access(pathname, operation) Check the permission fo the fle or directory pathname. Ifthe type of operation species allowed, return a tre value. The argument operation isa number between 0 and 7, inclusive, and encodes four features: exists, executable, wiable, and readable. These features have symbolic names: >> import os| >> 08.F_OK, 08.X_OK, o5:W_OK 08 R_OK (0.1,2,4) To query a specie combination of features, you may add or bitwise-or the individual features. >>> 0s.access(myflle’,0s.W_OK |os.R_OK) 1 >>> os.access(myflle,0s.X_OK + 08.A_OK) ° >>> os.access(myfle’, 6) 1 os.chdir(pathname) CChange the current working clectory to the path pathname. SEE ALSO: os getow) 75 os.chmod(pathname, mod: CChange the mode of fe ordrectory pathname to numeric made made. See theman page for thechmod lily for more information an rmades. os.chown(pathname, uid, gid) CChange the owner and group o le or crectory pathname to uid andgid respectively. See theman page forthe chown utity for more information. os.chroot(pathname) CChange the root directory under Unic-tke systems (on Python 2.24). See the man page for thechroot utility for more infarmation. os.getewd() Return the current working directory as a sting poe os.getewal) "WUsersiquity/B00k SEE ALSO: os.chai() 75; os.getenv(var [,value=None]) Return the value of environment variable var. the environment variable isnot defined, returrvalue. An equivalent calls os. environ getvar, valve) ‘SEE ALSO: os.environ 81; ,putenv() 78; os.getpid() Return the current process id, Possibly useful for calls o extemal utes that use process id's. SEE ALSO: os kil) 76; os.kill(pid, sig) Kil an extemal process on Unix-like systems. You will need to determine values forthe pid argument by some means, such as a call to the ps tity. Values forth signalsig sent othe process may be found in thesignal module or withman signal. For example: >> rom signal import * >>> SIGHUP, SIGINT, SIGQUIT, SIGIOT, SIGKILL (1,2,3.6,9) >>> det kl_by_name(progname): pidstr = os popen(pslgrep'sprogname+ sort). read) pid = intpidstr. spit) os lipid, 9) >> k_by_name( myprog}) os.link(sre, dst) (Create a hard lnk rom path ste to path dst on Unicike systems. See theman page on then utility for more information SEE ALSO: os symlink) 8 os listdir(pathname) Return alist of the names of les and directories at path pathname. The special entries fr the current and parent directories (typically and .") ate excluded from te list. os.{stat(pathname) Information on ile or directory pathname. See os sta for detalis. os. stat) does not fllow symbolic links. SEE ALSO: os stal) 79; stat 69; os.mkdir(pathname [,mode=0777]) Create acrectory named pathname wth the numere modemade, On some operating systems mode is ignored. See theman page fr the chmad uty for moe information on mages. SEE ALSO: os.chmod() 75;08.mkdis() 77; os.mkdirs(pathname [,mode=0777]) Create acrectory named pathname wth the numeric modemode, Unik os mk his funtion wil create ary intermediate directories needed fra nested airetoy ‘SEE ALSO: os.mikdi) 76; os.mkfifo(pathname [,mode=0666]) Create a named pipe on Unix-ike systems. os.nice(increment) Decrease the process priority of the current application under Unix-like systems. This is useful if you do not wish for your application to hog system CPU resources. ‘he four functions in the os popen") family allow you to run extemal processes and capture their STDOUT and STDERR andior set their SSTDIN. The members ofthe family difer somewhat in how these three pipes are handled os.popen(emd [,mode="r" [,bufsize]]) (Open a pipe to or fom the extemal command cmd. The retum value ofthe function isan open file abject connected tothe pipe. The ‘made may ber for rea (the defauit)orw for write. The exit stalus ofthe command is returned when te file object closed. An optional butfer size busize may be specified >> import os >> det lipat ‘stdout = 0s. popent's pat) result = sidoutread() satus = stdout close) if status: print "Error status", status else: print result >>> Ia(nosuchfle) Is: nosuchfle: No such fle or directory Error status 256 >> Io(chap{7-9]. x’) chap7 tt os.popen2(cmd [,mode [,bufsize]}) (Open both STDIN and STDOUT pipes tothe external command ema. The retum value isa par of fle objects connecting tothe two respective pipes. mode andboutsize work as wih s.papen) SEE ALSO: os,popens\) 78:08 popen() 77; ©s.popen3(cmd [,mode [,bufsize]}) (Open STDIN, STDOUT, and STDERR pipes tothe extemal command cmd. The return value is a S-uple of file objects connecting to the three respective pipes. mode and bulsize work as wi os popen( >> import os| >>> sain, stdout, stderr >> print >>stdn, ine one? >> print >>stdin, ne two" >> stdin mite(tine then)’ oe stain lase() o> stdoutread)) ‘LINE one'nLINE twolnLINE three\n’ oe stderr.eadl) 5 popend(‘sed sline/LINE") os.popend(cmd [,mode [,bufsize]}) (Open STDIN, STDOUT, and STDERR pipes tothe extemal command emé. n contrast toas,popen),es.popend{) combines STDOUT ‘and STDERR on the same pipe. The return value isa pipe of file objects connecting tothe two respective pipes. mode and bufsize work as with 0s.popen(, SEE ALSO: os,popens\) 78:08 popen() 77; os.putenv(var, value) Sel the environment variable var to the value value. Changes to the curent environment only affect subprocesses ofthe cutent process, such as those launched with os.systemy) or 0s;popen() not the whole OS. Calisto os,puteny) will update the environment, bt not theos.environ variable. Therefore, itis better to updateos.environ directly (which also changes the extemal environment) ‘SEE ALSO: 05 enviton 81;05.geten\) 75: 0s.popeni) 77; os.system\) 80; os.readlink(linkname) Return a string containing the path symbolic lnk inkname points to. Works on Uni-lke systems. SEE ALSO: os symlink) 8 os.remove(filename) Remove the fle named flename, Tis function is identical toos.untink). the file cannot be removed, anOSErro is raised, SEE ALSO: os.unink) 8 os.removedirs(pathname) Remove the dtecory named pathname and any subdirectories of pathname. This funtion wil a emave iectories wih les, an wil raise an OSErorif you atept todo so. SEE ALSO: os,rmair) 79; os.rename(sre, dst) Rename the fie orcliectory sre as dst. Depending on the operating system, the operation may raise arOSEtror if dst already exists SEE ALSO: os,renames() 79: os.renames(sre, dst) Rename the file orcliectory sre as dst. Unike os.rename(, this function wil create any intermediate directories needed for a nested rectory SEE ALSO: os.rename) 79; os.rmdir(pathname) Remove the directory named pathname, This function will not remove nonempty directories and wil raise anSErrorf you attempt to do SEE ALSO: os,removedis|) 79 os.startfile(path) Launch an application under Windows system. The behavior isthe same as if path was double-icked in a Drives window or as it you 'yped stat at a command line. Using Windows associations, a datafile can be launched inthe same manner as an actual executable application. SEE ALSO: os system) 60; os.stat(pathname) Create a stat result object that contains information onthe fle or drectoryathname. Astat result object has a numberof attributes and also behaves lke a tuple of numeric values. Before Python 2.2, only the tuple was provided. The atbutes ofa stat result object are named the same as the constans in the stat module, but in lowercase, >> import os, stat oe fle_info = os.stal'chapt xt) >>> fle_info [stat ST_SIZE] 87735. (On some platforms, addtional attributes are available, For example, Univ-ike systems usually have st blocks, st bksize, and st_rdev atributes; MacOS has st 'size,.st_creator, and_st type; RISCOS has.st_fype, statis, and st_obiype. ‘SEE ALSO: sat 69; 0s stat) 76 os.strerror(code) Give a description fora numeric errar cade code, such as that retumed byos.popen(bad_cma). close). ‘SEE ALSO: 05 popen() 77; os.symlink(src, dst) Create a soft ink from path sr to pathdst on Uniclke systems. See theman page onthe In uility for more information ‘SEE ALSO: os ink!) 76; 0s. readink) 78 os.system(cmd) Execute the command cmd in a subshel, Unlike execution usingos popen() the output ofthe executed process is not captured (but it may sill echo tothe same terminal 2s the current Python application). In some cases, you can use os.systemi) on non-Windows systems to detach an application in a manner similar to os startle). For example, under MacOSX, you could launch the TextEctt application wih > import os >>> omde"/AppliationsTextEcit appiContents(MacOS/TextEdt &* pop os.system(omd) ° ‘SEE ALSO: 05 popen() 77; os startle() 79; commands 73; os.tempnam({dir [,prefix]]) Return a unique filename for a temporay fle. optional argument diris specified, that directory will be used in the path; prefix is, specified, the fle wll have the indicated prefix. For most purposes, itis more secure to use os.tmpe(o directly obtain file object rather than fst generating a name. ‘SEE ALSO: temptlle 71;0s.tmpfle) 80; os.tmpfile() Return an “invisible fe obectin update made. This fle doesnot creat acrectory ent, ut simply acs as a ranien ute for data on the filesystem, SEE ALSO: tempiile 71; StringlO 183;StinglO 153; os.uname() Return detalled information about the current operating system on recent Unit-Ike systems. The returned 5-tuple contains sysname, rrodename, release, version, and machine, each as descriptive strings. os.unlink(filename) Remove the fle named flename, This function is identical toos.remove(). If the file cannot be removed, arOSErtoris raised. SEE ALSO: os.removel) 78; os.utime(pathname, times) Sethe access and modification timestamps of file pathname to the tuple(atime, mime) spectid intmes. Alternately, times is None, set both timestamps tothe current time, ‘SEE ALSO: ime 86; 0s.chmodi) 75;0s.chown() 75;0s.sat) 79 CONSTANTS AND ATTRIBUTES os.altsep Usually None, but an alternative path delimiter (1") under Windows. os.curdir ‘The string the operating system uses to refer tothe current giectory for example, "on Unix or" on Macintosh (before MacOSX). os.defpath ‘The search path used by exec'p"() and spawn'p'() absent a PATH environment variable. os.environ Adictionary-tke object containing the current environment. >> os environ TERM] ‘veio0" >> 0s environ{ TERM] = 1220" >>> 0s.gelenwTERM) ‘we20" ‘SEE ALSO: os getenvi) 75;08.putenv{) 7; os.linesep ‘The string that deimit ines ina fle; for example "n" on Unix,“ on Macintosh, os.name {A string identiying the operating system the curent Python interpreter is running on. Possible strings include posix, nt, dos, mac, 082, ce, java, andriscos. os.pardir ‘The string the operating system uses to refer to the parent directory; for example, "." on Unix or":" on Macintosh (before MacOSX) os.pathsep ‘The string that delimts search paths; fr example, "" on Windows or" on Unix. os.sep The string the operating system uses to refer to path delimiters; for example "on Unix, ""'on Windows, "on Macintosh. ‘SEE ALSO: sys 49; 0s path 65; 1.2.4 Special Data Values and Formats random. Pseudo-random value generator Python provides better pseudo-random number generation than do most C libraries wth arand() function, but not good enough for cryptographic purposes. The petiod of Python's Wichmann-Hil generator is about 7 trlion (7e13), but that merely indicates how long it wil take a paricular seeded generator to cycle; a itferent seed wil produce a diferent sequence of numbers. Python 2.3 uses the superior Mersenne Twister generator, which has a longer period and has been better analyzed. For practical purposes, pseudorandom ‘numbers generated by Python are more than adequate for random-seeming behavior in applications. ‘The underiying pseudo-random numbers generated by the random module can be mapped into a varity of nonuniform pattems and > import random poe 112 =00 >> forx in range(10}: {1 += random expovariate(t/20) 12 += random expovarate(20,) >> print 1/100, 2/100 18.4021962198 0.0588234083388 random.gamma(alpha, beta) Return a floating pont value witha gamma distribution (not the gamma function), random.gauss(mu, sigma) Return a floating point value with a Gaussian distribution; the mean is mu and the sigma issigma, random.gauss()is slighty faster than random normalvarate) random.lognormvariate(mu, sigma) Return a floating pont value with log normal cstribution; the natural logarithm ofthis distribution is Gaussian with mean mu and sigma sigma. random.normalvariate(mu, sigma) Return a floating pont value with a Gaussian distribution; the mean is mu and the sigma issigma, Return a floating pont value witha Pareto distribution. alpha specifies the shape parameter. random.random() Return a floating pont value inthe range [0.0 1.0). random.randrange([start=0,] stop [,stey Return a random element from the spectied range. Functionally equvalent to the expression random.choice(range(start stop step), but itdoes not build the actual range abject. Use random randrange() in pace ofthe deprecatedrandom randint) random.seed(( ime.time()}) Intaize the Wichmann-ill generator. You do not necessarily need to call random. seed), since the current system time is used to initalize the generator upon module import. Butt you wish o provide more entropy in the inital state, you may pass any hashable object as argument x. Your best choice forxis a posive long integer less than 27814431486576L, whase value is selected at random by independent means. random.shuffle(seq [,random=random.random]) PPermute the mutable sequence seq in place. An optional argumentrandom may be spectied to use an allernate random generator, butt is unikely you wil want to use one. Possible permutations get very big very quicky, so even for moderately sized sequences, not every permutation will occur. random.uniform(min, max) Return random floating point value in the range [min, max). random.vonmisesvariate(mu, kappa) Return a loating pont value witha von Mises distribution. mus the mean angle expressed in radians, anckappa isthe concentration parameter. random.weibullvariate(alpha, beta) Return a loating pont value with a Weibull cistibution. alphais the scale parameter, andbetais the shape parameter. struct « Create and read packed binary strings ‘The structmodule allows you to encode compactly Python numeric values. This module may also be used to read C structs that use the same formats; some formatting codes are only useful for reading C structs. The exception struct eroris raised if format does not match ts sting or values. ‘format string consists ofa sequence of alphabetic formating codes. Each code is represented by zero of more bytes inthe encoded packed binary string. Each formatting code may be preceded by @ number indicating a number of occurrences. The entie format string may be preceded by a global tag. I the lag @ is used, platform-natve data sizes and endianness are used. In al other cases, standard data sizes are used. The flag big-encian representations. xplcity indicates platform enclanness;< indicates litle-endian representations;> or indicates ‘The avaiable formating codes ae listed below. The standard sizes are given (check your platform fr its sizes if plattorm-natve sizes are needed) Formatting codes for struct module x padbyle Obytes © char 1 bytes b signed char 1 bytes B unsigned chart bytes shor int 2bytes H_ unsigned shot 2 bytes i int 4 bytes 1 unsigned int 4 bytes ong int 4 bytes L_unsignediong 4 bytes. @ longlong int 8 bytes Q unsigned iong!ong —Bbytes tfloat bytes double Bbytes s sting packed to size Pp Pascal ting padded to size P charpointer bytes ‘Some usage examples clay the encoding: >> import struct >> stucl pack'5s5p2c'sssppp'c'') '981x00\x003x03ppp\x00ec > stu pack(h 1) ‘wooxot" > stuctpack(, 1) *yO0.x000x00%01 > stuctpacktT, 1) *w00.x001x00%01 o> stu pack, 1), ‘wor x00 x00%007 > stuctpack(, 1) "7801x0000" >> stucl pack, 1,2;3) "00x01 x001x001x00 x00\x001x02:x00\x00%00x03" FUNCTIONS struct.calesize(fmt) Return the length ofthe string that corresponds to the format fmt struct.pack(fmt, v1 [,v2 [..-]]) Return a string with values v1 et alia, packed according to the format mt. struct.unpack(fmt, s) Return a tuple of values represented by sing s packed according othe format fmt {ime Functions to manipulate dateitime values ‘The time module is useful both for computing and displaying dates and time increments, and for simple benchmarking of applications and functions. For some purposes, eGenix.com's mx Date module is more useful for manipulating datetimes than igime. You may obtain ‘mx Date trom: Time tuples—used by several functions—consist of year, month, day, hour, minute, second, weekday, Julian day, and Daylight Savings flag. All values are integers. Month, day, and Juan day (day of year) are one-based; hour, minute, second, and weekday are zero-based (Monday is 0). The Daylight Savings flag uses 1 for DST, 0 for Standard Time, and -1 for "best guess.” CONSTANTS AND ATTRIBUTES time.accept2dyear Boolean to allow two-digit years in date tuples. Default i true vale, in which case the fist matching date since ime. gmtime(0)is extrapolated, >> import time >> le accept2dyear 1 >> lime Jocaltime(time.mitime((99,1,1,0,0.0,0.0,0))) (1999, 1, 1,0, 0, 0,4, 1,0) >> time gmtime(0) (1970, 1,1,0,0,0,3, 1,0) time. time.daylight ‘These several constants show information on the curent timezone. Different locations use Daylight Savings adjustments during citferent portions ofthe year, usualy bu nat always @ one-hour adjustment. ime. daylight indicates only whether such an aglustment i available in tme.altzone. ime.timezone indicates how many seconds west of UTC the current zone istime altzane adjusts that for Daylight Savings if possible. time.tzname gives a tuple of stings describing the current zone. >> me daylight, ime tzname (1, (EST, EDT), >> time altzone,time.timezone (14400, 12000) FUNCTIONS time.asctime({tuple=time.localtime()]}) Return a string description ofa time tuple. >> ie asctime((2002, 10,25, 1, 51, 48,4, 288, 1)) "Fri Oct 25 01:51:48 2002" ‘SEE ALSO: time clime() 87;time_strtime() 86; time.clock() Return the processor time forthe cutent process. The raw value returned has lite inherent meaning, but the value is guaranteed to increase roughly in proportion tothe amount of CPU time used by the process. This makes time clock) useful for comparative benchmarking of various operations or approaches. The values returned should not be compared between different CPUs, OSs, and so on, but are meaningful on one machine. For example: impor time startt = time clock!) approach_one() timet = time clock()-start stat2 = time clock!) approach_two() timed = time clock(-stat2 ittimer > time2: print "The second approach seems better” ese: Print "The fist approach seems better” Always use time.clock( for benchmarking rather thantime.time(). The latter isa low-resolution "wall lock” only. time.ctime({second: ime.time()]) Return a string description of seconds since epoch >> hme ctime(1035526125) "Fri Oct 25 02:08:45 2002" SEE ALSO: time asc) 87: Return a time tuple of seconds since epoch, giving Greenwich Mean Time. >> ime gmtime( 1035526125) (2002, 10, 25, 6, 8, 45,4, 298, 0) ‘SEE ALSO: ime Jocatime() 88 time.localtime({seconds=time.time()]) Return a time tuple of seconds since epoch, giving the local ime. >>> ime localtime( 1035526125) (2002, 10, 25, 2, 8, 45, 4,298, 1) ‘SEE ALSO: me gmtime() 88; time mtime( 88; time.mktime(tuple) Return a numberof seconds since epoch corresponding toa time tuple. >> ime mtime((2002, 10, 25, 2, 8, 45, 1035526125.0 298, 1)) ‘SEE ALSO: ime Jocatime() 88 time.sleep(seconds) ‘Suspend execution for approximately seconds measured in “wall clock" time (not CPU time). The argumentseconds is a floating point value (precision subject to system timer) andi fully thread safe. time.stritime(format [,tuple=time.localtime()}) Return a custom sting description of atime tuple. The format given inthe string format may contain the folowing elds: %a"%A"4W for abbreviatedtulldecimal weekday name; eb%48)%m for abbreviatedulldecimal month; 34y/%Y for abbreviatedull year, hd or ay-of month; %H1%4 for 24/12 clock hour: for day-ot-year;%eM for minute;%ep for AMLPN:94S for seconds; %UI%EW for week-of-year (Sunday/Monday stat); 96% for locale-aporopriate datetimeldateltime:%Z for imezone name, Other characters may occur inthe format also and will appear a terals (a iteral % can be escaped). > import time >> tuple = (2002, 10, 25,2, 8,45, 4,298, 1) >> lime sitime("%A, 6B %ed "ay (week %U)" tuple) "Friday, October 25 02 (week 42)" ‘SEE ALSO: me asctime) 87: time.cime) 87; time.stptime() 89 eb Yod %H:%I time.strptime(s [,format S%Y"]) Return atime tuple based on a sting descriton of time The format given nthe sting format flows the same rues asin {ime sttime(). Not avaiable on most platforms. SEE ALSO: ime sttime() 88; time.time() Return the number of seconds since the epoch fr the current time, You can specifically determine the epach using time.cime(0), but ‘rmally you wil use oe functions inthe me modile to generate useful values. Even thougltime.time( is also generally ‘rondecteasing ints return values, you should use time.clock) for benchmarking purposes. >>> time.ctime(0) "Wed Dec 31 19:00:00 1969" oe time time() 1035585490.484154 >> ime ctime( 1035585497) "Fri Oct 25 18:37:17 2002" ‘SEE ALSO: time clock( 87;time.ctime() 67; ‘SEE ALSO: calendar 100; Cn an Ceanus] Ps rrevious [ix] 1.3 Other Modules in the Standard Library your application performs other types of tasks besides text processing, a skim of ths module lst can suggest were to look for relevant functionality. As well, readers wo find themselves maintaining code written by other developers may find that unfamilar modules are imported by the existing code. If an imported module is not summarized inthe lst below, nor documented elsewhere, itis probably an in-house orthirt-party module. For standard library modules, the summaries here wil atleast give you a sense of the general purpose of a given module. ‘Access to built-in functions, exceptions, and other abjcts. Python does a great jo of exposing its um internals, but "normal" developers o not need ta worry about this. 1.3.1 Serializing and Storing Python Objects In object-oriented programming (OOP) languages ike Python, compound data and structured datas frequently represented at runtime as, native objects. Attimes these objects belong to basic datatypes, tuples, and dictionaries—ut more often, once you reach a certain degree of complexity, hierarchies of instanoes containing attributes become more likely For simple objects, especially sequences, serialization and storage is rather straightforward. For example, ists can easly be represented in delimited or fixed-length stings. Lists-of-ists can be saved in ine-oriented fies, each line containing delimited fields, or in rows of [DBMS tables. But once the dimension of nested sequences goes past two, and even more so for heterogeneous data structures, traditional table-riented storage is a less-obvius fit \While itis possible to create “objectelational adaptors” that write COP instances to flat tables, that usually requites custom programming ‘Arnumber of more general solutions exst, both in the Python standard library and in tic-party tools. There are actually two separate issues involved in storing Python objecs. The frst issue is how to conver them into strings in the frst place; the second issue fs how to create a general persistence mechanism for such serialized objects. Ata minimal level, of course, itis simple enough to store (and retvieve) a serialization sting the same way you would any other sring—to a file, a database, and so on. The various “dbm modules create a “dictionary on disk," while the shelve module automaticaly utlizes cPickle serialization to write arbitrary objects as values (keys are sil strings) ‘Several tir: party modules suppor object serialization wih special features. you need an XML calect fr your object representation, the modules gnosis xmlpickle and xmlpolbare useful. The YAML format is both human-readablelediable and has support iares for Python, Per, Ruby, and Java; using these various libraries, you can exchange objects between these several programming languages. ‘SEE ALSO: gnosis xmipickle 410, yami 415, xmipctb 407; DBM: Interfaces to dbm-style databases ‘Addom-style database isa “dictionary on dis.” Using a database ofthis sort allows you to store a set of keyival pars toa fie, or les, on the local filesystem, and to access and set them as if they were an in-memory dictionary. A dom-style database, unike a standard ictionary, always maps strings to strings. I you need to store other types of objects, you will need to convert them to strings (or use the shelve module as a wrapper) DDepenaing on your platform, and on which extemal lari are installed, cferent dom madules might be avaiable. The performance characteristics ofthe various modules var significantly. As wel, some DBM modules support some special functionally. Most ofthe time, however, your best approach isto access the locally supported DBM module using the wrapper module anydbm. Calls o this module will selec the best available DBM forthe current environment without programmer or user having to worty about the underying support, mechanism. Functions and methods are documents using the nonspecific capitalized form DBM. In real usage, you would use the name ofa specifi ‘module. Most ofthe time, you wil get or set DBM values using standard named indexing; for example, di'key"| A few methods characteristic ofcictonaries are also supported, as well as a few methods special to DEM databases, SEE ALSO: shelve 98; dct 24; UserDict 24 FUNCTIONS DBM.open(fname [,flag="r" [mode=0666]]) Open the filename fname for dbm access. The optional argumentlag species how the database is accessed. A value ofr is for read-only access (on an existing dhm file); w opens an already existing fle for readiwrite access will create a dalabase or use an existing one, with readiwrite access; the option nwill aways create a new database, erasing the one named fname iit already existed. The optionalmode argument specifies the Unix mode ofthe file(s) created. METHODS DBM.close() (ose the database and flush any pending writes. DBMirst() Return the first keyval airinthe DBM. The orders abitrary but stable. You may use the DBM) method, combined with repeated calls to DBM.next(), to process every item in the dictionary. In Python 22s, you can implement an items) funtion to emulate the behavior of the tems() method a ictonaies for DAMS: >>> from _future__import generators >>> det items(d): ty: yield first) while 1 yield db.next() except KeyEtror: raise Stoplteration poe forkwvin items(@): # typical usage print ky DBM.has_key(key) Return a true value if the DBM has the key key. DBM.keys() Return alist of sting keys in the DBM. DBMLast() Return the last key/val pair in the DBM, The order is arbitrary but stable. You may use the DBMJast) method, combined with repeated calls to DBM previous(), to process every item inthe dictionary in reverse order. DBM.next() Return the next keyival pair inthe DBM. A pointer tothe current postion is always maintained, so the methods DBM next() and DBMprevious() can be used to access relative items. DBM previous() Return the previous key/val pair in the DBM. A pointer tothe current position is always maintained, so the methods DBMLnest) and DBMprevious() can be used to access relative items. DBM.sync() Force any pending data tobe written to disk. SEE ALSO: FILE tush) 16; MODULES anydbm Genetic interface to underlying DBM suppor. Calls o this madule use the functionality ofthe "best available" DBM module. Ifyou open an exiting database fl, its type is quessed and used—assuming the current machine supports that style. ‘SEE ALSO: whichd 93; bsddb Interface tothe Berkeley DB library. dbhash Interface tothe BSD DB library, dbm Intertace to the Unix (nom tibrary. dumbdbm Intertace to slow, but portable pure Python DBM, gdbm Interface tothe GNU DBM (GDBM) library. whichdb Guess which db package to use to open a db fle, This module contains the single function whichdb.whichdb(). It you open an existing [DAN fe with anydbm this function is called automaticaly behing the scenes. SEE ALSO: shelve 9; ‘Pickle Fast Python object serialization | pie unary pl etn | ‘The module cPickeis a comparatively fast C implementation ofthe pure Pythonpickle module. The streams produced and read bycPickle and pickle are interchangeable. The only time you should preferpickies inthe uncommon case where you wish fo subclass the picking base class; cPichleis many times faster to use. The classpickle Pickler isnot documented here. ‘The Pickle and pickle modules suppor a both binary and an ASCII format. Neither is designed for human readabilty, bu itis not hugely Gifcut to read an ASCII pickle. Nonetheless, if readability isa goal, yam or gnosis.xm pickle are better choices. Binary format produces smaller pickles that are faster to write o load, Itis possible to fine-tune the pickling behavior of objects by defining the methods ._getstate_(),._selstate_0, and._getintargs_0. ‘The paticular black magic invocations involved in defining these methods, however, are not addressed in ths book and are rarely necessary for “normal” objects (Le, those that represent data structures). Use ofthe cPickle or pickle module is quit simple: >> import ePicKle >> rom somewhere import my_complex_object > S = cPickle.dumps(my_complex_objec >>> new_obj = cPickleloads(s) FUNCTIONS pickle.dump(o, file [,bin=0]) cPickle.dump(o, file [,bin=0]) Write a serialized form of the object oto the file-ike abjctfle. If the optional argumentbin i given a true value, use binay format. pickle.dumps(o [,bin=0]) cPickle.dumps(o [,bin=0]) Return a serialized form of the object oas a strng. I the optional argumentin is given a true value, use binary format pickle.load(file) cPickle.load{{ile) Return an object that was serialized asthe contents ofthe fie-ke object file. pickle.loads(s) cPickle.load{s) Return an object that was seraized inthe sting s. ‘SEE ALSO: gnosis xmipickle 410;yami 415; marshal marshalis a limited-purpose seraization tothe pseudo-compiled byle-code format used by Pythonpyc fies. Internal Python objec serialization. For more general object serialization, use pickle, cPickle, or gnosis. xmi.pckle, or the YAML tools at feed itp ivambor [beim Peaypasteis aoneee ‘The module porints similar tothe bul-in functionrepr(} and the madulerepr. The purpose of pprints to represent objects of basic atatypes in a more readable fashion, especialy cases where collection types nest inside each other. In simple cases print pformat ‘and repr()produce the same result; for more complex objectspprint uses newiines and indentation to ilustrate the structure ofa collection \Where possible, the string representation produced by porn tunctions can be used to re-create objects wi the bui-ineval) ind the module pornt somewhat limited in that it does not produce a particulary helpful representation of objects of custom types, which might themselves represent compound data Instance attributes are very frequently used in a manner similar to dictionary Keys. For example: >> import point pop det = (1.725, (H/uvp}ITyT,S4) >> del2 = (this!tha’, um, ‘dotdet) oe class Container: pass os inst = Container) >> insl.his, inst num, instdt = that, 38, det >>> print prit(dct2) (dot (ew, I(T T'S 1.7528, hum’ 38, “his: that) >> port print(inst) ‘<__main_Container instance at Ox418770> Inthe example, det2 andinst have the same structure, and elther might plausibly be chosen in an application as a data container. But the latter pprint representation only tolls us the barest information aboubvhat an objects, not what data it contains. The min-module below enhances prety-printing: pprint2.py from pprint import pformat impor sting, sys ef plormat2(o) it hasattro, lines __module_ '%8.% instance at OxYex>’ % (module, Klass, do) lines.appendldesc) for kv in 0,_dlet_items() lines.append{ instance. %es=%4s' % (k,pformat(y)) return sting jin(tnes,\n}) eke: return pprntpformato) et pprint2\o, steam=sys.stdout) stream. wrte(pformat2(o}+"n) Continuing the session above, we get a more useful report >> import porint2 >> pprint2.porntinst) ‘<__main_Container instance at x415770> instance this-that instance.det(t, Up): (0, ,'8, Y],1.7:28) instance.num=38 FUNCTIONS pprint.sreadable(o) Return a true vale ifthe equalty below holds: val(pritpformat(o) pprint.isrecursive(o) Return a true vale ifthe objecto contains recursive containers. Objects that contain themselves at any nested level cannot be restored with eval) pprint.pformat(o) Return a formatted string representation of the object 0 pprint.pprint(o [,stream=sys.stdout]) Print the formatted representation of the abject oo the fle-ike objectstream. CLASSES pprint.PrettyPrinter(width=80, dept indent=1, stream=sys.stdout) Return a prettyprinting object that wil format using a width of with, wil imi recursion to depthdepth, and wil indent each new level by indent spaces. The method porntPrettyPrintr pprint) wil rte to the fle-ike objectstream. >> pp = porn PrettyPrnterwiath=30) >>> pp porint(det2) (dt (1.7:285, (ruse Er, hum’ 38, “his: that) METHODS ‘The class pprintPrettyPrnfer has the same methods as the module level functions, The only difference is thatthe stream used for _pornt PrettyPrnter porn) is configured when an intance is inalized rather than passed as an optional argument. ‘SEE ALSO: gnosis xmipickle 410;yami 415; repr Alternative object representation ‘The modiule repr contains code for customizing he string representation of object. nits default behavior the funcioepr.repr) provides a length-lmited sting representation of objects—in the case of large collections, csplaying the entire collection can be unwieldy, and Unnecessary for merely cstinguishing objects. For example: >> det = dlcti(nstrin) for nin ranger))) >> repre!) # much worse fr, e9., 1000 item act "00, 1211, 2:2, 8°94 4:4, 5:9 >> rom repr import repr >> repridet) "0:0, 12112: o>. dt! 00, 1211,2: ‘The backctck operator does not change behavior it the built-in repr) function is replace. You can change the behavior ofthe repr.repr) by macitying atibutes ofthe instance objectrepr. Rep > det = ict{(nstr(n) for nin range(6))) >> repridet) "0:0, 1211, 2:2, 8:9 4:4, 55)" >>> import repr >>> repe.repridet) Inmy opinion, the choice of the name fr this module i unfortunate, since itis identical to that ofthe built-in function, You can avoid some of the colision by using the as form of importing a in >> import repr as_repr >> rom repr import repr as newrepr For fine-tuned control of object representation, you may subciass the class repr-Repr. Potentially, you could use substtutablerepr() functions to change the behavior of application output, but if you anticipate such a need, itis better practice to give a name that indicates this for example, overidable_repr() CLASSES repr-Repr() Base for customized object representations. The instance repr.aRepr automatically exist in the module namespace, so this cassis useful primarily as a parent class. To change an atribut, its simples justo set itin an instance. ATTRIBUTES repr.maxlevel Dept of recursive objects to folow. repr.maxdict repr.maxlist repr.maxtuple 'Numiver of items ina collection ofthe indicated type to include inthe representation. Sequences default to 6, dicts to 4 repr.maxlong [Numiver of digits of along integer to stingty. Default i 4. repr.maxstring Length of sting representation (e.g, s{:ND. Default is 30. repr.maxother ‘Catctal" maximum length of other representation. FUNCTIONS repr.repr(o) BBehaves ike built-in repr), but potentially witha diferent string representation created repr.repr_TYPE(o, level) Represent an object of the type TYPE, where the names used are the standard type names. The argumentevel indicates the level of recursion when this method is called (you might want to decide what to print based on how deep within the representation the object is). ‘The Python Library Reference gives the example: class MyReprirepr.Repr): def repr_fle(slt, obj level) ifobj name in stdin», ‘stdout’, ‘stderr>) return obj.name se: retum ‘ob aRepr = MyRepr() Print aReprreprsys.stdin) prints ‘’ [shelve « General persistent dictionary The module shelve bulds on the capabilities ofthe DBM modules, but takes tings a step forward. Unlike with the DBM modules, you may we arbitrary Python objects as values in a shelve database. The keys inshelve databases, however, must stl be stings. ‘The methods of shelve databases are generally the same as those for ther underying DBMS. However, shelves do not have thst), last), .next() oF previous () methods; nor do they have theitems () method that actual dictionaries do. Most of the time you wl simply Use name-indexed assignment and acoess. But from time to time, the available shelve.get), shelve. keys), shelve.sync}), shelve.has_key0, and shelve close(} methods are usell Usage ofa shelve consists of a few simple steps Ike the ones below: >>> import shelve sheive.open(test_shelve) o> shkeys() (his poe shnew_key] = (12, 3:4 (1 >> shkeys() This’, new key >>> shfnew_key] (1:2,3¢4, (8, W, IEIT, T's, 1D >> del shf'this] o> shkeys() [new key] >o> sh.coset) Inthe example, l opened an existing shelve, and the previously existing key/value pair was available. Deleting a key/value pairis the same as doing so trom a standard dictionary. Opening a new shelve automaticaly creates the necessary fe(s). Aough shelve only allows strings to be used as keys, ina pinch itis not dificult to generate strings that characterize other types of immutable objects. For the same reasons that you do not generally want to use mutable objects as clctionary keys, iis also a bad idea to Use mutable objects as shelve keys. Using the bulinhash) method is a good way to generate srings—but keep in mind that tis technique does not strictly quarantee uniqueness, soi s possible (out unkely to accidentally overrite ents using this hack: >>> or Se hash((1.2.3.45)) "95612314 o> ox % hashi3.1415) "eaad0902" o> Yo % hash(38) Oe o> %ex %ehash(38}) "o2nb58e3" Integers, notice, are ther own hash, ad stings of digits are common. Therefore, if you adopted this approach, you would want to hash strings 2s wel, before using them as keys. There is no real problem with doing so, merely an extra indirection step that you need to remember to use consistent: o> shi%x’ % hashanother_key)] = ‘another value’ o> shkeys() [new key’, 89ef0ca >> sh{%x’ % hash(another_key'] “another value >>> shfanather_key] Traceback (most recent cal ast File "cstdins, ne 1, in? File “swilbipython2 2ishelve.py, ine 70, in__geftem_ £ = StinglO(secitkey)) KeyEzor: another key lt you want to go beyond the capabiiies of shelve in several ways, you might want to investigate the third-party brary Zope Object Database (ZODB). ZODB allows arbitrary objects to be persistent, not only dictionary-ke objects. Moreover, ZODB lets you store data in ways other than in local les, and also has adaptors for multiuser simultaneous access. Look for details at: Jaw. 2098. o%g/Wkis/ZODB/StandaloneZODE ‘SEE ALSO: DBM 00; cot. 24 PhO 0 OO OO OO HO ‘The rest ofthe listed modules are comparatively unikely to be needed in text processing apolications. Some modules are specific to a particular platform: i so, this s indicated parenthetical. Recent cistrbutions of Python have taken a “batteries included” approach—much ‘more is inctuded in a base Python cstribution than is with other free programming languages (but other popula languages sil have a range of existing libraries that can be downloaded separately) 1.3.2 Platform-Specific Operations _winreg ‘Access tothe Windows registry (Windows). AE AppleEvents (Macintosh; replaced by Carbon. AE} aepack Conversion between Python variables and AppleE-vent data containers (Macintosh aetypes Apple vent objects (Macintosh). applesingle ucimentary decoder for AppleSingl format fes (Macintosh). buildtools Build MacOS applets (Macintosh calendar Print calendars, much lke the Unix cal utlily. A variety of functions allow you to pint or stringy calendars for various tie frames. For example, >> print calendar month(2002,11) November 2002 Mo Tu We Th Fr Sa Su 123 45678910 11121314 151617 18 19.2021 2223.24 25 26 27 28.29 30 Carbon.AE, Carbon.App, Carbon.CF, Carbon.Cm, Carbon.Ctl, Carbon.Dig, Carbon.Evt, Carbon.Fm, Carbon.Help, Carbon.List, Carbon.Menu, Carbon.Mlte, Carbon.Qd, Carbon.Qdoffs, Carbon.Qt, Carbon.Res, Carbon.Scrap, Carbon.Snd, Carbon.TE, Carbon.Win Interfaces to Carson API (Macintosh) ed (CD-ROM access on SGI systems (IRI. cfmfile Code Fragment Resource module (Macintosh) ColorPicker Intertace tothe standard color selection dialog (Macintosh) tb Interface tothe Communications Tool Box (Macintosh) dl CellC functions in shared objects (Unix). EasyDialogs Basic Macintosh clogs (Macintosh, fentl ‘Access to Unix font) andioent) system functions (Unix) findertools AppleEvents interface to MacOS finder (Macintosh) fl, FL, flp Functions and constants for working with the FORMS library (FIX), fm, FM Functions and constants for working with the Font Manager liar (IRD) fpectl Floating point exception control (Unis). FrameWork, MiniAEFrame ‘Structured development of MacOS applications (Macintosh). gettext The module gettext eases the development of multiingual appications. While actual translations must be performed manually, this module aids in identitying strings fr translation and runtime substitutions of language-specific strings. arp Information on Unix groups (Unix, locale Control the language and regional settings for an application. The locale seting affects the behavior of several functions, such as {ime str) and sting lower). The locale module is also useul fr creating stings such as number with grouped digits and currency strings for specie nations. mac, macerrors, macpath Macintosh implementation of os module functionally. tis general better to useosclrecty and let it callmac where needed (Macintosh). macts, macfsn, macostools Filesystem services (Macintosh) MacOS ‘Access to MacOS Python interpreter (Macintosh macresource Locate script resources (Macintosh). macspeech Interface to Speech Manager (Macintosh). mactty Easy access serial to line connections (Macintosh mkewproject Create CodeWarrior projects (Macintosh) msvert Miscellaneous Windows-spectc functions provided in Microsoft's Visual C++ Runtime libraries (Windows) Nac Interface to Navigation Services (Macintosh nis ‘Access to Sun's NIS Yellow Pages (Unix) pipes Manage pipes ata finer level than done by os.p9peni) and its relatives. Relibilty varies between platforms (Unix). PixMapWrapper Wrap PixMlap objects (Macintosh), posix, posixtfile ‘Access to operating system functionally under Unix. The as module provides more portable version ofthe same functionality and should be used instead (Unix) preferences Applicaton preferences manager (Macintosh) pty Pseudo terminal utes (\RIX, Linux pwd ‘Access to Unix password database (Unix) pythonprefs Preferences manager for Python (Macintosh). py_resource Helper to create PYC resources for compiled applications (Macintosh). quietconsole BBuered, nonvisible STDOUT output (Macintosh) resource Examine resource usage (Uni) syslog Intertace to Unix syslog library (Unix). tty, termios, TERMIOS POSIX ty control (Unis). Widgets for the Mac (Macintosh) waste Interface tothe WorldScript- Aware Styled Text Engine (Macintosh) winsound Interface to audio hardware under Windows (Window). xdrlib Lmplements (a subset of) Sun eXternal Data Representation (XDR).In concept, xcs similar tothe stuctmodule, but the format is less Widely used. 1.3.3 Working with Multimedia Formats aife Read and write AIFC and AIFF audio fies. The interface to aicis the same as for thesunau and wave modules. al, AL ‘Audio functions for SGI (IRI) audioop Manipulate raw audio data chunk Read chunks of IFF audio data, colorsys Convert between RGB color madel and YIQ, HLS, and HSV color spaces, gl, DEVICE, GL Functions and constants for working with Silicon Graphics’ Graphics Library (IRD. imageop Manipulate image dala stored as Python slings, For most operations on image fle, the thic-party Python Imaging Library (usually called PIL"; see = 4 words willbe sorted 4 lexographically onthe 4th, Sth, et.. words Any ine with fewer than four words willbe sorted to 4 the end, and will occur in natural” order. impor sys, sting, time wer = sys.stdertwtite naive custom sort ef fourth_wordnt Ing): Istt = string spitint) |st2= string spitin2) 4 Compare “long” lines iflen(stt) >= 4 and en(st2) return cmp(tst[3:} 15213) = Long lines before shot lines eliflen(stt) >= 4 and lenlst2) <4 return 1 ‘4 Short lines alter long nes elifien(stt) <4 and len(st2) >= 4 return 1 eke: ‘Natural order return emp(intn2) Don’ count the read itself inthe time lines = open(sys.argv{1)seadines() {8 Time the custom comparison sor start = time.time() lines sortfourth_word) end =time.time) \wrer(‘Custom comparison func in 8.21 secsin*% (end-stat) # opentimp custom’ W/).writeines(ines) Don’ count the read itself inthe time lines = open(sys. argu). readines() £8 Time the Schwartzian sort start = time.time() fornin range(ientines)): _# Create the transform ‘st= sting slitlinesin)) itlen(st) 4 Tuple sort inf fist lnes{n] = (1t(3, esto) eke: # Short lines to end lines{nl = (18771, lines) lines sort) # Native sort for nin range(ientines)): _# Restore original lines end =time.time() \wrer('Schwartzin trantorm sort in 93.2 secs" % (end stat) # opentimp schwarzianW/).writeines(ines) Only one particular example is presented, but readers should be able to generalize ths technique to any sort they need to perform frequently or on large fes. 2.1.2 Problem: Reformatting paragraphs of text \While | mour the dectine of plaintext ASCII as a communication format—and its eclipse by unnecessary complicated and large (and ote proprietary) formats—there is stil plenty of if lettin text files fll of prose. READMEs, HOWTOs, email, Usenet posts, and this book ise are writen in plaintext (or at least something close enough to plaintext that generic processing techniques are valuable). Moreover, many tomate te HTM ans EXTIEX ae equenty enough handed tha eaten appearance iinpodant (One task that is extremely commen when working with prose txt fies is reformating paragraphs to conform to desired margins. Python 2 adds the moduie textwrap, which performs more limited reformatting than the code below. Most othe time, this task gets done within text editors, which are indeed quite capable of performing the task. However, sometimes it would be nice to automate the formating process. The tasks simple enough that itis slightly surprising that Python has no standard module function todo tis. Theres the class formatter. Dumber, or the possibilty of inheriting from and customizingformatter. AbstractWiter. These classes are discussed in pier df but frankly, the amount of customization and sophistication needed to use these classes and thelr many methods is way out of proportion forthe task at hand. Below is a simple solution that can be used ether as a commande tool (reading from STDIN and wring to STDOUT) or by import toa larger application. reformat_para.py 4 Simple paragraph reformatter. lows specication of left and right margins, an of justification style 4 (using constants defined in module). LEFT.RIGHT CENTER = LEFT!/RIGHT, CENTER’ ef reformat_paralpara="Jft=0 fg words = para spt) EFT): end_words =O while not end_ words: iflen(words{word) > ightleft: # Handle very long words line = words{wora] Word +=1 it word >= len{words: ‘end_words = 1 ise: # Compose line of words wile lenline) slen(words{word) <= rghtet: line += words{word]s Word += 1 | word >= len( words): end words break lines.appendtine) line ifjuste=CENTER: 1 right, left return Vn join! “ettsln.center() for nin ines) eli juste=RIGHT: return ‘tn join{ine.qustright for ine in tines) cle: # let justity retuin ‘join! “ettstine for ine in tnes) impor sys itlen(sys.argy) <> 4 print “Please speciyle_ margin, ight_marg, justification” eke: 4 Simplistic approach to finding intial paragraphs {or pin sys.stin.ead), split} print reformat_para(p lf sight just 0 ‘Arnumber of enhancements are lft to readers, needed. You might want to allow hanging indents or indented fist lines for example. Or Paragraphs meeting certain criteria might not be appropriate for wrapping e.g, headers). A custom application might also determine the input paragraphs dtferenty eer by adfferent parsing ofan input file, or by generating paragraphs internally in some manner. 2.1.3 Problem: Column statistics for delimited or flat-record files Data feeds, DBMS dumps, lg files, and flatfile databases all tend to contain ontlogially similar recards—one per line—with a colection of fields in each record, Usually such fields are separated elther by a specified delimiter or by specific column positions where fields are to occur. Parsing these structured text records is quite easy, and performing computations on fields is equally straightforward. But in working with 2 variety of such “structured text dalabases,”it is easy to Keep writing almost the same code over again for each variation in format and computation. ‘The example below provides a generic framework for every similar computation on a structured text database ields_stats.py ' Perform calculations on one or more of the ‘fields in a structured text database. import operator from types import * trom xreadlines import xreadines # req 2.1, but is much faster. # could use seadine() meth < 2.1 Symbolic Constants DELIMITE: FLATFILE=2 4 Some sample “statistical” func (in funcional programming style) nilFune = lambda tst: None toFloat = lambda 1st: map(foa, 1st) avg_tst = lambda 1st: reduce(operator.ad, toFoat{s)/len(st) sum_fst= lambda 1st: reduce operator. ad, Foals) ‘max_1st= lambda 1st: reduce(max,toFiat()) class FieldStals: "Gather statistics about structured text database fields text_db may be ether string (incl. Unicode) or fie-ke object style may be in (DELIMITED, FLATFILE) 1 wtp 1.0 for pat in sys.au1 {or fein glob globipat): = opentfle} read) we = lens), len(s.spit), lenjs.spl(n)), ents spin) print join(map(st, we)) tle c,w, 1, = cewel0], wewef{], +we[2, pewef3] :P) Print Wjoin(mapistr, we), NTOTAL™ ese: s=sysstdinsead) wo =len(s), len(s.spit),len(s.spt(n), | lenis.spit(inin}) print Wjoin(mapist, we), STDIN" This litle tunctonalty could be wrapped up ina function, but itis almost too compact to bother with doing so. Most ofthe wark sin the interaction withthe shell environment, wih the counting basically taking only two lines. The solution above is quite likely the “one obvious way to doit" and therefore Pythonic. On the other hand a sightly more adventurous, reader might consider this assignment if only for un}: >>> we = mapilen{s}+map(.split None, ‘n,\nin)) ‘Areal daredevil might be able to reduce the entre program toa single print statement. 2.1.5 Problem: Transmitting binary data as ASCII Many channels require that the information that travels over them is 7-bt ASCII. Any bytes witha high-order frst bt of one will be handled unpredictably when transmitting data over protocols ike Simple Mail Transport Protocol (SMTP, Network News Transport Protocol (NNTP), ar HTTP (depending on content encoding), or even just when displaying them in many standard tools tke editors. In order to encode 8-bit binary data as ASCII, a number of techniques have been invented overtime {An obvious, but obese, encoding technique isto translate each binary byte into its hexadecimal digits. UUencoding isan older standard that developed around the need to transmit binary fles over the Usenet and on BBSs. Binnex isa similar technique from the MacOS ord. In recent years, base64—uwhich is specified by RFC1821—has edged out the other styles of encoding. Alla the techniques are basicaly 4/3 encodings—that i, four ASCII byes are used to represent three binary bytes—but they diler somewhat inline ending and header conventions (as well asin the encoding as such). Quoted printable is yet another forma, but of variable encoding length. In {quoted printable encoding, most plain ASCII bytes are left unchanged, but afew special characters and all high-bit bytes are escaped, Python provides modules forall the encoding styles mentioned. The highievel wrappers uu, binbex, ase4, and quoprall operate on input and output file-ike objects, encoding the data therein. They also each have slightly ciferent method names and arguments. binhex, for example, clases its output fle afer encoding, which makes it unusable in conjunction witha cStringiO fle-Ike objec. Al ofthe high-level encoders uilize the services ofthe lovtlevel C module binaso.binasc, n um, implements the actual low-level block conversions, but assumes that it willbe passed the right size blocks fo a given encoding ‘The standard library, therefore, does not contain quite the right intermediate-level functionality for when the goals just encoding the binary data in arbitrary strings. itis easy to wrap that up, though: encode_binary.py 1 Provide encoders fr arbitrary binary data in Python stings. Handles block size issues # transparently, and retusa sting 1 Precampression of the input sting can reduce 4 or elminate any size penalty for encoding. import sys, import zi impor binasct w=45 BASEG4-57 BINHEX = sys.maxint (et ASCllencode(s=",ype=BASE64, compress=") "ASCII encode a binary sting™ 4 Firs, decide the encoding style ittype == BASES4: encode =binasciLb2a_baseé4 elitype == UU: encode = binasciib2a_uu eli type == BINHEX: encode = binascib2a_hqx else: aise ValueE ror, "Encoding must be in UU, BASES4, BINHEX’ 4 Second, compress the source if specified it compress: s =zlib.comoress(s) 4 Third, encode the string, block-by-block offset = 0 blocks =() while: biocks appendiencode(s[ttset.ottsetstype)) offset += type itotiset > lens) break 4 Fourth, retum the concatenated blocks return "jointbiocks) def ASClldecode(s=",ype=BASE64, compress=") "Decode ASCII toa binary sting™ 4 Firs, decide the encoding style ittype nasi a2b_basess(e) eli type binasci.a2b_hox(s) eli type = "Joinfbinasci.a2_wullne for ine in ssplit(\n)}) 4 Second, decompress the source if spectied it compress: s = zlib.decompress(s) 4 Third, retum the decoded binary sting retum s # Encodeldecade STDIN for selttest it_name__=='_main_" decode, TYP! for arg in sys.argv: BASEG4 print ASClidecode(sys stcin.read)).type=TYPE) eke: pint ASCllencode(sys stdin. read)).type=TYPE) The example above does not atach any headers or delimit the encoded block (by design; fr tha, a wrapper like wu, mim, or ‘Mime Writers a better choice. Or a custom wrapper aroundencede_binary.py. 2.1.6 Problem: Creating word or letter histograms [histogram isan analysis ofthe relative occurence frequency of each ofa numberof possible values. In terms of text processing, the ‘ccurrences in question are almost always elther words or byte values. Creating histograms is quite simple using Python dictionaries, but the technique is not always immediately obvious to people thinking about it. The example below has a good generality, provides several utlity functions associated with histograms, and can be used in a commanctine operation mode. histogram.py ' Create occurrence counts of words or characters # A ew tility functions for presenting results 4 Avoids requirement of recent Python features from string import split, maketrans, ransiate, punctuation, digits import sys, from types import * impor types. et word_histogram(source) "Create histogram of normalized words (no punctorcgits)"" hist = () trans = maketrans(,") ittypetsource in (StringType,UnicodeType): # String-ike sre {or word in spit(source: Word = translate(word, trans, punctuationsciits) itlen(word) > 0 histor] = hist. get{word.0) + 1 el hasat(source read) 4 Filesike sro try ftom xreadlines import xreadiines # Check for module for line in xreadlines(source) for word in spt(ine): Word = ranslate(word, trans, punctuationsdigts) if len(word) > 0: ist{word] = hist get(word0) + 1 except Import ror: # Older Python ver line = sourcereadiine()__# Slow but memiendly while ine for word in spt(ine) Word = ranslate(word, trans, punctuation‘ digits) i en(word) > 0: ist{word] = hist get(word0) + 1 line = source readinel) eke: raise Typet:ror,\ “soutoe must be a sting-ke or fle-ke object” retum hist ef char_histogram(source, sizeint=1024"1024): hist = () ittypetsource in (StringType,UnicodeType): # String-ike sre for char in source: hist{char] = hist get(char,0) +1 el hasatt(source, read) 4 Filesike sro ‘chunk = sourceeaalsizehint) while chunk: for char in chunk histchar] = hist get(cha,0) +1 chunk = source.read{sizehint) eke: raise Typet:ror,\ “soutoe must be a sting-ke or fle-ke object” retum hist ef most_common(tist, num="): pairs =(] for pairin hist) pairs. appendipai{t} pairf0)) pairs.sor() patrs.reverse() retum pairsnum things. sor) for thing in things: pairs. appendi(ting.histthing)) pairs.sort() retum pairs:num] it_name_=='_main iflen(sys.argy > 1 hist = word_histogramiopen(sys.argv1)) eke: hist = word_histogram(sys. stdin) print "Ten most common words: for pairin most_commanitist, 10): print, pair], pai} print "First ten words alphabetical for pair in first thingsthist, 10) print, pair, pai] 4 a more practical commandt-tine version might use: 4 for pair in most_common(histen(ist): print part, pair0] ‘Several ofthe design choices are somewhat arbitrary. Words have al thelr punctuation stipped to identily “rea” words. But on the other hand, words are sill case-sensitive, which may not be what is desired. The sorting functions first_things() and most_commont) only return an initial sublis. Perhaps it would be beter to return the whol list, and let the user slice the result, Itis simple to customize ‘around these sors of issues, though. 2.1.7 Problem: Reading a file backwards by record, line, or paragraph Reading a file ine by ine is a common task in Python, orn most any language. Fes lke server logs, configuration fies, structured text databases, and others frequently arrange information into logical records, one per ine. Very often the jo ofa program i to perform ‘some calculation on each record in turn Python provides a number of convenient methods on file-ke objects for such line-by-tine reading. FILE reacnes( reads a wile fle at once and retums alist of lines. The technique is very fast, but requires the whole contents ofthe file be kept in memory. For very large files, this can be a problem. FILE reaaline() is memory-iendly—it just reads a line at a time and can be called repeatedly until the EOF is reached—but itis also much slower. The best solution for recent Pyton versions is areadlines.xreadlines) or FILE xreaalines() in Python 2.1+, These techniques are memory-iendly, whi sill being fst and presenting a "vital ist” of lines (by way of Python's new generatoriterator interface) ‘The above techniques work nicely for reading a fle in its natural order, but what it you want to start a the end ofa fle and work backwards from there? This need is frequently encountered when you want to read og fils that have records appended overtime (and winen you want to look at the most recent records firs). comes up in other situations also. There isa very easy technique it memory Usage is not an issue’ >>> opentines, write join(fn for nin range(100))) > fp = openttnes} readines() poe lines reverse() >> for ine in ines [15 4# Processing sue here print ne, Base For large input fies, however, his technique is not feasible. it would be nice to have something analogous to areadlines here. The ‘example below provides a good starting pont (the example works equally well fr file-ike objects. read_backwards.py {Read blocks of file trom end to beginning 4 Blocks may be defined by any delimiter, but the # constants LINE and PARA are usetul ones. 4 Works much ike the fle object method 'readine( repeated calls continue to get “next” part, and 4 function retums empl sting once BOF is reached. Define constants from os import inesep LINE = linesep PARA = linesep'2 READSIZE = 1000, Global vaiables butter = ef read_backwards|p, mode=LINE, sizeint=-READSIZE, "Read blocks of le backwards (return empty sting when done)"™ 4 Trick of mutable default argument to hold state between cals it not into foseeki0.2) ity {Finda block (using global buter) lobal butter while: 4 first check for block in butter delim = butter tindlmode) ifdelim <>-1: #block sin butter, return t block = butfer(delim+len{made):] butfer = butter:delim] retum blockemode #- BOF reached, return remainder (or empty sting) elf fptell)==0: block = butter butter = retum block ese: # Read some more data into the butter readsize = min((pteli)sizehint) tp.seek(readsize,1) butfer=fpreaditeadsize) + butter {p.seek(readsize,1) Self test of read_backwards() it_name. ain # Lets create a test fle to read in backwards {p= open(iines, wo) ‘p.write(LINE join(?— ed ~%4n orn in range(15))) 4 Now open for reading backwards {p= open(iines, 6) 4 Read the block in, one per call (block==tine by default) block = read_backwards(fp) while block: print block, block = read. backwards(p) Notice that anything could serve as a block delimiter. The constants provided just happened to work for ines and block paragraphs (and block paragraphs only withthe current OS's style o ine breaks). But other delimiters could be used. I would no be immediately possible to read backwards word-by-word—a space delimiter would come close, but would not be quite right for other whitespace. However, reading a line (and maybe reversing its words) is generally good enough, Another enhancement is possible with Python 2.2+. Using the new yield keyword read_backwards() could be programmed as an iterator rather than as a muit-call function. The performance wil nt difr significantly, but the function might be expressed more clearly (and a "isthe" interface lke FILE-readnes() makes the application's loop simpler) QUESTIONS 1: Wte a generator-based version ofread_backwards() thal uses the yield keyword. Modi the selt-test code to utlize the (generator instead. Explore and explain some pitfall with he use of a mutable default value asa function argument. Explain also how the style allows functions to encapsulate data and contrast withthe encapsulation of class instances. nn Ps rrevious [ix] 2.2 Standard Modules 2.2.1 Basic String Transformations The module sting forms the core of Python's text manipulation ibraries. That module is certainly the place to look before other modules. "Most of the methods in the string module, you should note, have been copied to methods of string objects from Python 1.6+. Moreover, ‘methods of string objects are a tle bit faster to use than are the corresponding module functions. A few new methods of string objects do ‘ot have equivalents inthe stringmodule, but at sill documented here, SEE ALSO: str 3; UserSiring 33; string A collection of string operations ‘There are @ numberof general things to notice about the functions inthe stringmadule (whichis composed entirely of functions and constants; no classes) Stings are immutable (as ciscussed in Chapter i). This means that there is no such thing as changing astrng "in pace" (as \We might doin many other languages, such as C, by changing the bytes at certain offsets wihin the sting). Whenever astring ‘module function takes a sting abject as an argument, it retums a brand-new string object and leaves the orignal one as is. However, the very common pattem of binding the same name on the lft of an assignment as was passed on the right side Within the sting madule function somewhat conceals this fact. For example: >>> import sting >>> str= "Mary had a ite lamio* >>> sir = sting seplace(str, had, ate’) poo str "Mary ate afte lamb “The first sting object never gets modiied per se; but since the fist sting object is no longer bound to any name ater the example runs, the object is subject to garbage collection and wll disappear trom memory. n short, caling asting module {unction wil not change any existing stings, but rebinding a name can make it look lke they changed. 2. Many string module functions are now also availabe as string object methods. To use these string object methods, there is no need to import the string module, andthe expression is usually sighlly more concise. Moreover, using a string object method is usually sighty faster than the corresponding string edule function. However, the most thorough documentation of each {unction’method that exists as both a string module function and a sting object method is contained inthis reference to thesting module 3. The form string joinsting split (.)) sa frequent Python idiom. A more thorough discussion is contained in the reference items {or string join) and sting. spit), but in general, combining these two functonsis very often a useful way of breaking down a text, processing the pats, then putting together the pieces. 4. Think about clever string replace) pattems. By combining muliplesting replace() calls with use of “place holder” sting pattems, a surprising range of results can be achieved (especially when also manipulating the intermediate strings with other techniques). See the reference item fr sting replace(for some discussion and examples. 5. A mutable sting of sorts can be obtained by using bul‘ lists, or the array module. Lists can contain a collection of substrings, each one of which may be replaced or modified individually. The array module can define arrays of individual characters, each postion modtiable, included with sice notation. The function stringjin()or the method “ain may be used to re-create tue strings; for example: pop tst= [spamiand'/eggs] >>> tsi2] = toast >>> print "jinlst) ‘spamandtoast >>> print "jointst) ‘spam and toast or, >>> import array Spam and eggs poe al] = aray.array(etoast) >>> pint "jin(a) Spam and toast CONSTANTS ‘The sting module contains constants for a number of Frequently used collections of characters. Each ofthese constans is itself simply a string (rather than a list, tuple, or other collection). As such, itis easy to define constants alongside those provided by the sting module, should you need them. For example: >>> import sting >>> string brackets = "DN)<>" >> pint string brackets 0 string.digits The decimal numerals ("0123456783") string.hexdigits The hexadecimal numerals (10123456789abcdetABCDEF") string.octdigits The octal numerals ("01234567"), string.lowercase ‘The lowercase letters; can vary by language. In Englsh versions of Python (most systems): >>> import sting o> string lowercase ‘abedetghikimnoparstuvwxy2" You should nat madly string Jowercase fora source text language, bu rather define a new attribute, such asting spanish_fowercase with an appropriate sting (some methads depend on this constant) string.uppercase ‘The uppercase letters; can vary by language. In English versions of Python (most systems) >>> import sting >>> string. uppercase "ABCDEFGHIJKLMNOPORSTUVWXYZ" You should not modiy string. uppercase for a source text language, but rather define a new attribute, such asting spanish_uppercase with an appropriate sting (some methads depend on this constant) string.letters All the letters (string lowercasesstring uppercase), string.punctuation ‘The characters normally considered as punctuation; can vary by language. In English versions of Python (most systems): >>> import sting >> string punctuation "raS%av <=> 2@ D0 string.whitespace The “empty” characters. Normally these consist of tab, nefeed, vertical tab, formfeed, cariage retum, and space (in that order}: >>> import sting >>> sting whitespace rornoraorsiarao15" You should not modity strin.whtespace (some methads depend on this constant), string.printable All the characters that can be printed to any device; can vary by language (string digits+string letters sting punctuationsstrng whitespace), FUNCTIONS string.atof( Deprecated. Use float Convers a sting toa floating point value. ‘SEE ALSO: eval) 445; lot) 422, string.atoi( {jbase=10}) Deprecated with Pytnon 2.0 Use int) i no custom base is needed or fusing Python 2.0. Converts a sting to an integer value (the string should be assumed to be in a base other than 10, the base may be specified as the second argument) ‘SEE ALSO: evall 445, int) 421; long) 422, string.atol(s=...[,base=10]) Deprecated with Python 2.0. Use fong()i no custom base is needed orf using Python 2.0. Converts string to an unlimited length integer value (the string should be assumed tobe in a base other than 10, the base may be specified asthe second argument) ‘SEE ALSO: evall) 445; long() 422, int) 42; string.capitalize(s: “capitalize() Return a string consisting ofthe initial character converted to uppercase (it applicable), and all other characters converted to lowercase (it applicable) >>> import sting >> sing capitalze( mary hada ite lamb") "Mary had a litle lamb >>> sing capitalze(Mary had a Litle Lam") "Mary had a litle lamb > strng.capitaize(2 Lambs had Mary!) "2 lambs had mary! For Python 1.6+, use of string object method is marginally faster and is stylistically prefered in most cases: >> "mary had aie lamb capitalze() "Mary had alte lamb ‘SEE ALSO: strng.capwords|) 133, sting ower) 138, string.capwords(s: -title() Return a string consisting ofthe capitalized words. An equivalent expression is string joinfmap(string.capttalize string spits) But string.capwords()is a clearer way of writing it. An efecto this implementation is that whitespace is “normalize by the process: >>> import sting >> stting.capwords("mary HAD a itl lamb") "Mary Had A Litte Lamb >>> siing.capwords(Mary hada Lite Lamb") "Mary Had A Litte Lamb \Wit the creation of string methads in Python 1.6, the module function string.capwords() was renamed asa string method to".ie() ‘SEE ALSO: string capitalize) 132 string lower!) 198," istile) 136, string.center(s=. .center(width) Return a string with s padded with symmetrical leading and trailing spaces (but not truncated) to accupy lengtwicth (or mor). >>> import sting oe string.center(width-30 s~"Mary had a litle lam") Mary had ait lami" >> sting.center("Mary had a ite lamb’, 5) "Mary had atl lamb" For Python 1.6+, use of string object method is stylistically preferred in many cases: >5> "Mary had a ite lamb center(25) " Mary ada ite lamb" SEE ALSO: string just) 198; string just) 141; string.count(s, sub [,start [,end]) count(sub [,start [,end]]) Return the number of nonoverlapping occurrences of sub ins. If the optional third or fourth arguments are specified, only the corresponding sce of sis examined. >>> import sting >>> string.count"mary had ait lamb’, °a") 4 >> string.count"mary had ait lamb’, "a, 3, 10) 2 For Python 1.6+, use of string object method is stylistically preferred in many cases: >> mary had ate lamb count(2") 4 endswith(suffix [,start [,end]]) This string method does not have an equivalent inthe string module. Return a Boolean value indicating whether the string ends withthe suffix sux. I the optional second argumentstartis specified, only consider the terminal substring after offseart. Ifthe optional third ‘argument end s given, only consider the slce[starend] SEE ALSO: startswih() 144; string.find) 135; string.expandtabs(s=...[tabsize=8}) .expandtabs({[,tabsize=8]) Return a string with tabs replaced by a variable number of spaces. The replacement causes text blocks to line up a tab stops." Ifo second arguments given, the new string wil ine up at multiples of 8 spaces. A newine implies a new set of ab stops. >>> import sting o> ='mary01thad a itl lamb o> prints mary had aitle lamb >>> sting expandtabs(s, 16) ‘mary hada ite lamb’ >>> sting expandtabs(tabsize ‘mary had alitle lamb’ For Python 1.6+, use of string object method is stylistically preferred in many cases: >> mary\01 Thad a litle lamiexpandtabs(25) ‘mary hada litle lamb string.find(s, sub [,start [,end]}) find(sub [,start [,end]]) Return the index postion of the first occurrence of sub ins. Ifthe optional third or fourth arguments are specitied, ony the corresponding slice of sis examined (but sul is postion ins sa whole). Retum -1 ino occurrence is found. Position is zero-based, as with Python lst indexing >>> import sting >> stting.ind('mary had aie lamb 1 >> string find(‘mary had ail lamb 6 > string find(’mary had ail lamb at >> string find(‘mary had aitle lamb 4 3, 10) For Python 1.6+, use of string object method is stylistically preferred in many cases: >>> mary had atl lamb'find'ad") 6 ‘SEE ALSO: string index()195;string.rind) 140; string.index(s, sub [start [,end]]) index(sub [,start [,end]]) Return the same value as does sting. find) with same arguments, except raiséValueEtror instead of returning -1 when sub does not occur >>> import sting >> string index('mary hada litle lamb’, "b") at >> string index("mary hada litle lamb’, "b, 3, 10) Traceback (most recent cal ast File "cstdin", ne 1, in? File “aipy20sttibstring.py" ine 139, in index retum sindex(‘args) \alueEror: substing not found in string index For Python 1.6+, use of string object method is stylistically preferred in many cases: >> mary had litle lamb’ index(ad" 6 ‘SEE ALSO: string.nd{) 198; string.rindex() 147 ‘Several string methods that return Boolean values indicating whether a string has a certain property. None othe is“) methods, however, have equivalents in the sting module: isalpha() Return a true vale if all the characters are alphabetic “isalnum() Return a true value if all the characters are alphanumeric. isdigit() Return a tue value ial the characters are digits. “islower() Return a true value if all the characters ae lowercase and there is atleast one cased character >5> "ab123" slower), "123‘islowen),'Ab123 isiower() (1,0,0) SEE ALSO: "Jower() 198 “isspace() Return a true value ial the characters are whitespace. -istitle() Return a true value ial the string has tile casing (each word capitalized) SEE ALSO: ".e) 133; “isupper() Return a true vale ital the characters are uppercase and thee is at least one cased character. SEE ALSO: “upper() 146, Return a string that results trom coneatenating the elements ofthe list words together, withsep between each. The functonstring join) Gitfers from al other sting module functions in that it takes alist (of strings) asa primary argument, ather than a sting Itis worth noting string.oin) and string spit) are inverse functions ifsep is speced to both; in other words, string oinstring spits sep) sep)==s for alls andsep, Typically, string joins used in contexts where itis natural to generate lst of stings. For example, here is small program to output the lst of al-capital words from STDIN to STDOUT, one per line: ist_capwords.py impor sting sys capwords for tne in sys.stin eadtines) for word in tne spit) ifword == word. upper() and word isalpha() ceapwords.append(word) Print string join(capwords, ‘n) ‘The technique in the sample list_capwords.py script can be considerably more ecient than building up a string by direct concatenation However, Python 2.0's augmented assignment reduces the performance diference: >>> import sting >o> .="Mary had ait lamb" poe tes fleece was white as snow" po>s=s4'"st relatively "expensive" for big stings po>s4=""4t cheaper’ than Python 1.x style is] >>> Istappendit)# “cheapest” way of bulding long sting >>> 8 = stringjoin(st) poo For Python 1.6+, use of sting object method is stystically preferred in some cases. However, ust as string on()is special in taking a lst asa fist argument, the sting abject method “joing is unusual in being an operation on the (optonalsep string, nat onthe (required) words list (this surprises many new Python programmers), SEE ALSO: string spit) 142, string joinfields(...) Identical to string join) widtl string.ljust(s: “Jjust(width) Return a string with s padded with trating spaces (but not truncated) to occupy lengthwidth (or more) >>> import sting >> sting. justwidtn=80,s="Mary had a ite lamb") "Mary had alte lamb o> strng.just(Mary had alte lamb, 8) "Mary had atl lamb" For Python 1.6+, use of string object method is stylistically preferred in many cases: >> "Mary had a ite lamb just(25) ‘Mary had atte lamb” ‘SEE ALSO: string ust) 141;string center() 193; string.lower(s=. “Jower() Return a string with any uppercase leters converted to lowercase. >>> import sting >> string lower('mary HAD a ite lamb") ‘mary had alle lamb! >>> string lower("Mary had a Lite Lami") ‘mary had alle lamb! For Python 1.6+, use of string object method is stylistically preferred in many cases: >>> "Mary had a Litle Lambf"Jower) ‘mary had alle lamb ‘SEE ALSO: string upper) 146; Istrip({chars=string.whitespace}) Return a string with leading whitespace characters removed. For Python 1.6, use of a string object method is stylistically refered in many cases: >>> import sting Mary had aitle lamb 011" >>> sting strips) "Mary had a litle lamb \011" poe sisting "Mary had a litle lamb \011" Python 2.3+ accepts the optional argument chars tothe string object method. All characters inthe stringphars wil be removed. SEE ALSO: string.stp() 142; string sti) 144; string.maketrans(from, to) Return a translation table sting for use with string translate(). The stingstrom andto must be the same length. translation table is @ string of 256 successive byte values, where each postion defines a translation from the chi) value ofthe index to the character contained al that index position. >>> import sting >>> ord) 65 >> ordl'2) 122 >> sting maketrans(ABC" abc)[65:123] ‘abcDEFGHUKLMNOPORSTUVWXYZI\}* ‘abedefghikimnoparstuvixyz > sting maketrans(ABCxye!abcXYZ)[65:123] ‘abcDEFGHUKLMNOPORSTUVWXYZI\}* ‘abedefghikimnoparstuvwX¥Z" SEE ALSO: string ranslate() 145; string.replace(s=. “replace(old, new [,maxsplit]) [maxsplit=..) Return a string based on s with occurrences ofold replaced by new. the fourth argumentmaxspitis speciied, only replace maxspitntal >>> import sting >> string.replace("Mary had a ite lam’, a litle, "some") ‘Mary had some lamb’ For Python 1.6+, use of string object method is stylistically preferred in many cases: >> "Mary had ait lamb replace(a iti’, “some") ‘Mary had some lamb’ ‘A.common "rick" involving string replace()is to use it ulpl times to achieve a goal. Obviously, simply to replace several diferent, substrings ina sting, multiple string replace) operations are almost inevitable, But there is another class of cases wherestringreplace() cam be used to create an intermediate string with placeholders" for an original substring in @ particular context. The same goal can always be achieved with regular expressions, but sometimes staged string replace) operations are both faster and easier to program: >>> import sting >>> line = variable = val # see comments #3 and #4 >> # we'd lke '89' and #4’ spelled out within comment >>> stiing.replace(tine,'#number’) _ # doesnt work Variable = val umber see comments number 3 and number 4 >> place_holder=string replace(ine,’#" Il" # int picholder >>> place_holder Variable = val II see comments #3 and #4° >> place_holder=place_olderreplace(#number ) # almost there >>> place_holder Variable = val I! see comments number 3 and number 4 >> line = string replace{place holder, #)# restore orig pooling ‘Variable = val # see comments number 3 and number 4° Obviously, for jobs tke this, a placeholder must be chosen so as not ever to occur within the stings undergoing "staged transformation” but that shouldbe possible generally since placeholders may be as long as needed, SEE ALSO: string ranslate() 145;mx TestToolsreplace() 314 string.rfind(s, sub [,start [,end]]) rfind(sub [,start [,end]]) Return the index postion ofthe last occurrence o sub in. the optional tir or fourth arguments are specified, only the corresponding slice of sis examined (but result is postion ins as a whole). Return ino occurrence is found. Position is zero-based, as with Python list indexing >>> import sting =) >>> singing mary had itl amb," 3, 1) 9 >>> stinging mary haa itl amb") 21 >>> singing mary had itl lamb," 3, 1) 4 For Python 1.6+, use of string object method is stylistically preferred in many cases: >>> mary had alle lamb'.tind(ac") 6 ‘SEE ALSO: string index() 141; string find) 195; string.rindex(s, sub [,start [,end]}) “rindex(sub [,start [Lendl]}) Return the same value as does sting. find) with same arguments, except raiseValueEror instead of returning 1 when sub does not occur >>> import sting >> string.rndex(*mary had a litle lamb", "b") at >> sring.ndex(*mary had atl lamb", "b", 3,10) Traceback (most recent cal ast File "cstdin", ne 1, in? File “dipy0sitibistring.py" ine 148, in rndex retum s.rindex(args) ValueEror: substring not found in string rindex For Python 1.6+, use of string object method is stylistically preferred in many cases: >> mary had litle lamb’ index(ad" 6 SEE ALSO: string,find() 140; string index) 195; rjust(width) etum a string with s padded with leading spaces (but not truncated) to occupy lengthwidth (or more) >>> import sting > string. ust(wigth-30.s-"Mary had alitle lamb”) "Many had atte lamb > string. fust("Mary had a tle lamb", 5) "Mary had ate lamb’ For Python 1.6+, use of string object method is stylistically preferred in many cases: >>> "Mary had a ite lambsust(25) Mary had a ite amb ‘SEE ALSO: strng.just)198:string center) 133; string.rstrip(s=...) “rstrip({chars=string.whitespace]) Return a string with traling whitespace characters removed. For Python 1.6, use ofa sting object method is stylistically prefered in many >>> import sting Mary had aitle lamb ott" >> sing stip(s) 012 Mary hada ite lamb’ o> ssstrip() N01 Mary hada ite amb’ Python 2.3+ accepts the optional argument chars tothe string object method. All characters inthe stringphars wil be removed. ‘SEE ALSO: string strip) 199; string stp) 144 string.split(s=... sep=...,maxsplit: -split({,sep [,maxsplit]]) Return alist of nonoverlapping substrings of .Ithe second argumentsep is specified, the substrings are vided around the occurrences of sep. Ifsep is not specified, the substrings are divided aroundany wtitespace characters. The dividing strings do not appear in the resultant st. the third argument maxspitis specified, everything lft over” after spitingmaxspllt parts is appended tothe ls, giving the lstlength ‘maxspit+1 >>> import sting >> s='mary had alle amb .with a glass of sherry” oe string splits, a") [mary had, ite lamb .with,'glass of shen’)

Vous aimerez peut-être aussi