Python weave

As part of my master thesis I have developed a data mining algorithm and implemented it in Python. Benchmarks revealed that most of the time is spend iterating over big arrays and calculating the statistical relevance of sub arrays. After some time of tweaking I came to think that maybe involving some C/C++ code might speed things up a bit…

I stumbled across weave, which lets you include inline C/C++ code in your Python script. The usage is quite simple. You basically provide the C/C++ code as a string and execute it. The first time this is done takes some time, since the compiler gets invoked, but afterwards your script simply calls the compiled code fragments. I have to admit, that this is not a fair and fully blown benchmark, but the first tests blew my mind =)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from numpy import array
from time import time
import scipy.weave.inline_tools as inline_tools

def weaveInline(arr):
    code = """
    long int res = 0;
    for (int i = 0; i < Narr[0]; i++) {
        res += arr[i];
    }
    return_val = res;
    """

    return inline_tools.inline(code,['arr'])

def nativePython(arr):
    res = 0
    for i in xrange(len(arr)):
        res += arr[i]
    return res

if __name__ == '__main__':
    checksum1 = 0
    checksum2 = 0

    f = open('benchmark.dat', 'w')
    f.write("#calls, arraysize, zero, native on python array, inline on numpy array\n")
    for calls in xrange(10,110,10):
        for arraysize in xrange(0,55000,5000):

            arr1 = range(1, arraysize)
            arr2 = array(range(1, arraysize))

            tmp = time()
            for i in range(calls):
                checksum1 += nativePython(arr1)
            t1 = time() - tmp

            tmp = time()
            for i in range(calls):
                checksum2 += weaveInline(arr2)
            t2 =  time() - tmp

            f.write("%d %d %f %f %f\n" %(calls, arraysize, 0.0, t1, t2))
        f.write("\n")
    f.close()
    print checksum1
    print checksum2

Basically I was interested in how much runtime is spend on the call of the function and how it evolves with the size of the array.

pyvsweave onlyweave

The first plot shows a comparison between Pythons native arrays and the C++ implementation. The second plot just illustrates that the C++ way doesn’t run in const time, but shows the same growth behaviour (only flatter). One thing about the implementation:

I did run the pure python version on numpy arrays too, but normal python lists seemed to perform best here.

After all keep in mind, that this is not a serious benchmark. I simply wanted to share this info =)

<3  gecko

Update

Here are the results of what Jörn ased for in one of the comments:

all joernvsc

The first graph shows that Jörn’s code performs similar to the weave code, although the weave version is a tiny bit faster, as shown in the second graph. I also tried arrays, that were bigger than my CPU’s L1 cache, but no significant difference here.
An interesting thing though is that depending on the size of the array, sometimes a periodic up and down in the runtime of the native python version occurs. This might be caching related…

Istanbul

Hier ein paar Fotos, die ich dort gemacht habe:

dsc03596 dsc03550

dsc03698 dsc03482

dsc03507 hagiasophia dsc03519

Und hier noch die Dreingabe… Alles scharf, außer mir! Priceless :)

dsc03512

<3 gecko

Python: Finding shared items

Today I investigated in how to find shared items in listes. My first approach was straight forward and I checked for each item in list1, if it occurs in list2. Over time this became a bit more elegant and turned into:
shared_items = dict( (s,0) for s in list1 if s in list2 )
But this takes quite a while and so I wondered, if there were better solutions. First thing I found, was that the numpy arrays seem to be much faster to iterate, but the second thing is much more performance boosting… It makes a huge difference for the runtime, if you use
shared_items = dict((s,0) for s in set(list1) and set(list2))
instead of
shared_items = dict( (s,0) for s in list1 if s in list2 )
EDIT:
Jörn posted in a comment, to just skip the bulls##t an go straight for:
shared_items = set(list1) & set(list2)
Have a look:
runtime comparison runtime runtime Joern vs Gecko
And here is the code, to give you an idea, of what I did…enjoy :)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python
# Task: Find all shared elements in two arrays
from numpy.random import randint
from time import time

print "Comparing two ways of finding shared elements"
# Two arrays with size 10000, filled with
# random values between 1 and  1000000
list1 = randint(1,1000000,10000)
list2 = randint(1,1000000,10000)

# Slow
t1_start = time()
shared1 = dict((s,0) for s in list1 if s in list2)
t1_stop = time()


# Fast
t2_start = time()
shared2 = dict((s,0) for s in set(list1) and set(list2))
t2_stop = time()

# Joern fast:
# Edited, after a comment from Joern...
t3_start = time()
shared3 = set(list1) & set(list2)
t3_stop = time()


# Results
print "Slow algorithm:  %.5f seconds." %(t1_stop - t1_start)
print "Fast algorithm:  %.5f seconds." %(t2_stop - t2_start)
print "Joern algorithm: %.5f seconds." %(t3_stop - t3_start)
print "Same result:    ", shared1 == shared2 == shared3

Although not shown here both methods do not only work on numbers, but on strings, or anything else that is sortable too…

Letztens im Milchschaum:

Kaffeesatz lesen ist ja ein alter Hut, aber manchmal offenbart sich die Zukunft auch im Milchschaum. Ich sehe, ich sehe:

Ich sehe ein Eichhörnchen! Die Zukunft meint es gut mit Ihnen und hält ein Eichhörnchen bereit.

Say cheeeeeeese!

This is the tip of a ballpoint pen…
I love my Macro lens :)
Self portrait on a ballpoint pen

Ich habe Land geschaffen…

Hier mein neustes kleines Projekt:
Eine kleine Agenten-Simulation. Das ganze besteht aus einer Server/Client Architektur. Eine Welt in die sich per Socket Agenten einloggen können. Natürlich gibt es Server und Client in Python, aaaaber… man kann die Agenten auch per netcat kreieren und steuern, wie man im Bild erkennen kann :)
Man kann mittels eines Bildes die Wände der Welt definieren, die von den Agenten erkannt werden. Die Jungs kollidieren auch schon mit den Wänden, aber noch nicht unter einander.
Simulation
Falls jemand Teile davon gebrauchen kann, oder den Code weiterentwickeln will: Fühlt euch frei den Code nach gut Dünken zu nutzen (quasi MIT-Lizenz). Hier gehts zum Download.
Der Code ist dokumentiert und eine README gibt’s auch, die die API und die Architektur erklärt. Viel Spass :D

Spaß mit Bildern

Bin letztens auf ein interessantes Paper gestoßen:
Das Thema ist, Copy & Paste in Bildern zu erkennen. Also im Prinzip den Einsatz von Photoshops Clone-tool.
Die Grundgedanke ist ganz einfach:
Zuerst werden aus dem Bild alle möglichen 16×16 Pixel großen Unterbilder extrahiert. Jedes Unterbild wird dann diskret Cosinus Transformiert (man schaut quasi welche Cosinus Anteile wie stark in den Unterbildern vorhanden sind) Zum Schluss vergleicht man die Ergebnisse der Transformationen untereinander. Cluster von ähnlichen Transformationen deuten stark auf Copy & Paste hin.

Hier hat sich jemand daran versucht das ganze nachzubauen. Ich hab’s auch probiert… überaschender Weise in Python ;)
Allerdings habe ich mich aus Gründen der Performanz gegen eine selbst geschriebene DCT und für numPy’s FFT entschieden. Es funktioniert, aber das Ergebnis überzeugt mich noch nicht so recht… Dafür habe ich mich in diesem Rahmen mal mit der Image-Library auseinander gesetzt und jetzt kann ich Bilder invertieren :D

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from PIL import Image,ImageOps
from sys import argv

# open the image
filename = argv[1]
im = Image.open(filename)
w,h = im.size

# read the pixel data in a list
data = list(im.getdata())

# invert each pixel
counter = 0
for y in range(0,h):
    for x in range(0,w):
        RGBs = data[counter]
        counter += 1
        im.putpixel((x,y), (255-RGBs[0], 255-RGBs[1], 255-RGBs[2]))

# come up with a good name
tmp = filename.split('.')
tmp.pop() # kick file-extension
outputName = ''
for part in tmp: # In case there is a '.' in the name...
    outputName += part
    outputName += '_inverted.png'

    # save the inverted image
    im.save(outputName)

Hier ein Beispiel:

original

original_inverted

Einfach den Code unter Invert.py speichern und im Terminal folgendes tippen:

1
python Invert.py original.jpg

Der Barcode hatte Geburtstag

Letztens hatte der Barcode Geburtstag und die Jungs von Hackaday.com hatten eine kleine Challange aufgesetzt. Es ging darum einen Barcode zu entziffern…
Das hat mein Interesse geweckt! Hier ist meine Lösung dazu und noch zwei Barcodes zum rumspielen… Enjoy =)

wikipedia Barcode

hackaday Barcode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
#!/usr/bin/python

import Image
from sys import argv

# lookup tables for the 3 different charsets
CharSetA =  {
    ' ':0, '!':1, '"':2, '#':3, '$':4, '%':5, '&':6, "'":7,
    '(':8, ')':9, '*':10, '+':11, ',':12, '-':13, '.':14, '/':15,
    '0':16, '1':17, '2':18, '3':19, '4':20, '5':21, '6':22, '7':23,
    '8':24, '9':25, ':':26, ';':27, '<':28, '=':29, '>':30, '?':31,
    '@':32, 'A':33, 'B':34, 'C':35, 'D':36, 'E':37, 'F':38, 'G':39,
    'H':40, 'I':41, 'J':42, 'K':43, 'L':44, 'M':45, 'N':46, 'O':47,
    'P':48, 'Q':49, 'R':50, 'S':51, 'T':52, 'U':53, 'V':54, 'W':55,
    'X':56, 'Y':57, 'Z':58, '[':59, '\':60, ']':61, '^':62, '_':63,
    '
\x00':64, '\x01':65, '\x02':66, '\x03':67, '\x04':68, '\x05':69, '\x06':70, '\x07':71,
    '
\x08':72, '\x09':73, '\x0A':74, '\x0B':75, '\x0C':76, '\x0D':77, '\x0E':78, '\x0F':79,
    '
\x10':80, '\x11':81, '\x12':82, '\x13':83, '\x14':84, '\x15':85, '\x16':86, '\x17':87,
    '
\x18':88, '\x19':89, '\x1A':90, '\x1B':91, '\x1C':92, '\x1D':93, '\x1E':94, '\x1F':95,
    '
FNC3':96, 'FNC2':97, 'SHIFT':98, 'Code C':99, 'Code B':100, 'FNC4':101, 'FNC1':102, 'START A':103,
    '
START B':104, 'START C':105, ' STOP':106
}

CharSetB = {
    '
':0, '!':1, '"':2, '#':3, '$':4, '%':5, '&':6, "'":7,
    '
(':8, ')':9, '*':10, '+':11, ',':12, '-':13, '.':14, '/':15,
    '
0':16, '1':17, '2':18, '3':19, '4':20, '5':21, '6':22, '7':23,
    '
8':24, '9':25, ':':26, ';':27, '<':28, '=':29, '>':30, '?':31,
    '
@':32, 'A':33, 'B':34, 'C':35, 'D':36, 'E':37, 'F':38, 'G':39,
    '
H':40, 'I':41, 'J':42, 'K':43, 'L':44, 'M':45, 'N':46, 'O':47,
    '
P':48, 'Q':49, 'R':50, 'S':51, 'T':52, 'U':53, 'V':54, 'W':55,
    '
X':56, 'Y':57, 'Z':58, '[':59, '\':60, ']':61, '^':62, '_':63,
    '
' :64, 'a':65, 'b':66, 'c':67, 'd':68, 'e':69, 'f':70, 'g':71,
    '
h':72, 'i':73, 'j':74, 'k':75, 'l':76, 'm':77, 'n':78, 'o':79,
    '
p':80, 'q':81, 'r':82, 's':83, 't':84, 'u':85, 'v':86, 'w':87,
    '
x':88, 'y':89, 'z':90, '{':91, '|':92, '}':93, '~':94, '\x7F':95,
    '
FNC3':96, 'FNC2':97, 'SHIFT':98, 'Code C':99, 'FNC4':100, 'Code A':101, 'FNC1':102, 'START A':103,
    '
START B':104, 'START C':105, ' STOP':106
}

CharSetC = {
    '
00':0, '01':1, '02':2, '03':3, '04':4, '05':5, '06':6, '07':7,
    '
08':8, '09':9, '10':10, '11':11, '12':12, '13':13, '14':14, '15':15,
    '
16':16, '17':17, '18':18, '19':19, '20':20, '21':21, '22':22, '23':23,
    '
24':24, '25':25, '26':26, '27':27, '28':28, '29':29, '30':30, '31':31,
    '
32':32, '33':33, '34':34, '35':35, '36':36, '37':37, '38':38, '39':39,
    '
40':40, '41':41, '42':42, '43':43, '44':44, '45':45, '46':46, '47':47,
    '
48':48, '49':49, '50':50, '51':51, '52':52, '53':53, '54':54, '55':55,
    '
56':56, '57':57, '58':58, '59':59, '60':60, '61':61, '62':62, '63':63,
    '
64':64, '65':65, '66':66, '67':67, '68':68, '69':69, '70':70, '71':71,
    '
72':72, '73':73, '74':74, '75':75, '76':76, '77':77, '78':78, '79':79,
    '
80':80, '81':81, '82':82, '83':83, '84':84, '85':85, '86':86, '87':87,
    '
88':88, '89':89, '90':90, '91':91, '92':92, '93':93, '94':94, '95':95,
    '
96':96, '97':97, '98':98, '99':99, 'Code B':100, 'Code A':101, 'FNC1':102, 'START A':103,
    '
START B':104, 'START C':105, ' STOP':106
}

# The barcode and the according number in the according char-set
ValueEncodings = {
    0:'
11011001100',  1:'11001101100',  2:'11001100110',
    3:'
10010011000',  4:'10010001100',  5:'10001001100',
    6:'
10011001000',  7:'10011000100',  8:'10001100100',
    9:'
11001001000', 10:'11001000100', 11:'11000100100',
    12:'
10110011100', 13:'10011011100', 14:'10011001110',
    15:'
10111001100', 16:'10011101100', 17:'10011100110',
    18:'
11001110010', 19:'11001011100', 20:'11001001110',
    21:'
11011100100', 22:'11001110100', 23:'11101101110',
    24:'
11101001100', 25:'11100101100', 26:'11100100110',
    27:'
11101100100', 28:'11100110100', 29:'11100110010',
    30:'
11011011000', 31:'11011000110', 32:'11000110110',
    33:'
10100011000', 34:'10001011000', 35:'10001000110',
    36:'
10110001000', 37:'10001101000', 38:'10001100010',
    39:'
11010001000', 40:'11000101000', 41:'11000100010',
    42:'
10110111000', 43:'10110001110', 44:'10001101110',
    45:'
10111011000', 46:'10111000110', 47:'10001110110',
    48:'
11101110110', 49:'11010001110', 50:'11000101110',
    51:'
11011101000', 52:'11011100010', 53:'11011101110',
    54:'
11101011000', 55:'11101000110', 56:'11100010110',
    57:'
11101101000', 58:'11101100010', 59:'11100011010',
    60:'
11101111010', 61:'11001000010', 62:'11110001010',
    63:'
10100110000', 64:'10100001100', 65:'10010110000',
    66:'
10010000110', 67:'10000101100', 68:'10000100110',
    69:'
10110010000', 70:'10110000100', 71:'10011010000',
    72:'
10011000010', 73:'10000110100', 74:'10000110010',
    75:'
11000010010', 76:'11001010000', 77:'11110111010',
    78:'
11000010100', 79:'10001111010', 80:'10100111100',
    81:'
10010111100', 82:'10010011110', 83:'10111100100',
    84:'
10011110100', 85:'10011110010', 86:'11110100100',
    87:'
11110010100', 88:'11110010010', 89:'11011011110',
    90:'
11011110110', 91:'11110110110', 92:'10101111000',
    93:'
10100011110', 94:'10001011110', 95:'10111101000',
    96:'
10111100010', 97:'11110101000', 98:'11110100010',
    99:'
10111011110',100:'10111101110',101:'11101011110',
    102:'
11110101110',103:'11010000100',104:'11010010000',
    105:'
11010011100',106:'11000111010'
}# '


# swap the dictionaries
def swap_dictionary(original_dict):
    # old version
#   return dict( [(v, k) for (k, v) in original_dict.iteritems()] )
    # refined version... with generator expressions...thx J
    return dict( (v, k) for (k, v) in original_dict.iteritems() )

# I found these encoding tables on the web...
# sadly the key:value pairs are the wrong way for my need
# so i swap them =)
CharSetA = swap_dictionary(CharSetA)
CharSetB = swap_dictionary(CharSetB)
CharSetC = swap_dictionary(CharSetC)
ValueEncodings = swap_dictionary(ValueEncodings)


# read in image
im_file = Image.open(argv[1])
im = im_file.getdata()
width = im.size[0]
height = im.size[1] / 2

# convert to 1 pixel high stripe in the middle of the image
cropped = im.crop((0,height,width,height+1))

# translate to 0's and 1's
code = ''
for pixel in cropped:
    try: # if is RGB picture
        pix = pixel[0]
    except: # if is greyscale picture
        pix = pixel
    if pix >200:
        pix = 0
    else:
        pix = 1
    code += str(pix)


# remove leading and trailing 0's
code = code.lstrip('0').rstrip('0')
   
   
# pop first 11 bits from the code
# these 11 bits stand for a char
def get_char(code):
    char = code[:11]
    code = code[11:]
    return char, code


out_buff = ""
# determine the encoding
first, code = get_char(code)
encoding = ValueEncodings[first]

if encoding == 103:
    charset = CharSetA
    out_buff += 'START A '
elif encoding == 104:
    charset = CharSetB
    out_buff += 'START B '
elif encoding == 105:
    charset = CharSetC
    out_buff += 'START C '
else:
    print "Encoding unknown"
    exit(0)


# decypher the message
while(len(code) >= 11):
    next_char, code = get_char(code)
    encoding_number = ValueEncodings[next_char]
    out_buff += charset[encoding_number]

# display the message
print out_buff

Einfach als barcode.py speichern und von der Konsole aus mit:

1
python barcode.py BarcodeBild.jpg

starten. Das Ganze funktiniert allerdings nur unter bestimmten Voraussetzungen. Der Barcode muss vom Typ 128 sein und das Bild des Barcodes muss eine 1 mit einem ein Pixel breiten Strich kodieren.

Nachtbild

Hier mal ein Bild mit Langzeitbelichtung vom Stahldrachen an der Feinkost in Leipzig.
dsc03062-klein

1st school day

My little nieces had their first day in school and I took a some pictures.

bild-17

bild-21