[pywikibot] Replace.py: very slow reading from XML dump

Discussion:

Bináris

2018-09-16 20:03:42 UTC

Hi folks,

I still use trunk/compat for many reasons, but as I see the new code at
https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the
core version must suffer from the same problem.

If we use -namespace for namespace filtering, class
XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is
filtered by a namespace generator. This may MULTIPLY the running time in
some cases and this may cost hours or even days for a fix of complicated,
slow regexes.
I have just checked, that dump does contain namespace informÃ¡tion. So why
don't we filter during the scan?

I made an experiment. I modified my copy to display count of articles and
count of matching pages. The replacement was:
(ur'(\d)\s*%', ur'\1%'),
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole
dump, I used -xmlstart.) It went through 820 thousand pages and found 240+
matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from
live wiki, this time with namespace filtering on. (I don't replace in this
phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part
of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(

I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.

--
BinÃ¡ris

i***@gno.de

2018-09-17 06:10:00 UTC

Permalink

There is another issue with that generator: it always checks for replacements but does not apply them which means replacements are always done twice which might slow down the run too. I think we should open a Phabricator task for it.
Best
Xqt

Post by BinÃ¡ris
Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is filtered by a namespace generator. This may MULTIPLY the running time in some cases and this may cost hours or even days for a fix of complicated, slow regexes.
I have just checked, that dump does contain namespace informÃ¡tion. So why don't we filter during the scan?
(ur'(\d)\s*%', ur'\1%'),
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used -xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this time with namespace filtering on. (I don't replace in this phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
--
BinÃ¡ris
_______________________________________________
pywikibot mailing list
https://lists.wikimedia.org/mailman/listinfo/pywikibot

Bináris

2018-09-17 07:03:08 UTC

Permalink

I have done the work for compat, now it is running, and I plan to open the
ticket when I get the numbers.
As far as I know, compat is unfortunatley totally deprecated. Is there
despite this any possibility to upload a patch? Otherwise I can describe
here, what I did. I know that people still use compat.

Let's talk about the second problem. I am not sure it may easily be solved
for the satisfaction ov everybody, but I was already thinking about it. (I
have plenty of plans concerning replace.py which is quite poor now.)

Advanced use of replace.py needs a two-run approach. First we collect
candidates from dump wit no human inteaction, while the bot owner sleeps or
works or does any useful thing.
The titles are collected to a file, and in the second run the owner
processes them interactively, much faster. All the belongings of this
process that I implemented to compat are totally missing from core now,
making replace.py useless for me, but this is obviously a temporary state.
So let's think in the way, that replace.py must help direct immediate
raplecements as well as two-run replacements.

If you want to replace immediately in one run, the replacement should be
done only once to spare time. But where? XMLdumpgenerator is a separate
class that can yield pages, I don't hink we should remove it. The main
class has the complexity of how to handle the separate cases and human
interactions. I don't think we should transfer it to XMLdumpgenerator.
Perhaps the best solution is if XMLdumpgeneratordoes not want to replace,
just to search. This will be some faster.

If you save the titles for later human processing, XMLdumpgenerator does
not have to do the replacement in most of cases, just search again.
There is a third case: when I develop new fixes, I often do experiments, It
is useful to see the planned replacements during the first run, this helps
me to enhance the fix. So I wouldn't totally remove the replacing ability.
This needs a separate switch which can be ON by default when we use -xml.

Please keep in mind that to accelerate the generator is important, but keep
the seped of main replacebot high is even more important. When you want to
totally avoid double work, you don't use dump at all.

So I have three ideas for the work of this:

1. The switch tells the generator to replace the second parameter of
replacement tuples with ''. I don't have numbers, how faster this would
be. This has some danger, so the bot must ensure that the switch is
effective only if we save titles to a file, or we work in simulation mode,
not to destroy wiki.
2. The generator will search instead of replace. I Don't like this idea,
because textlib.py has the complexity to listen for exceptions and comments
and nowikis etc.
3. We enhance textlib.py so that replaceExcept() will have a new
parameter. This will make replaceExcept() to use a search rather than a
replace. *This is the good solution.* In this case the function could
return a dummy text which differs from original, so that we don't have to
rewrite the scripts which use it.

Anyhow, replaceExcept() needs another enhancement which I already did in my
copy. It should optionally return (old, new) pairs for further processing,
this is very useful for developing fixes, measuring efficiency, creating
statistics etc. This will be a separate task, but if you agree with this
solution, we may add the two new parameters in one run.

So we have three tickets now. :-)

Post by i***@gno.de
There is another issue with that generator: it always checks for
replacements but does not apply them which means replacements are always
done twice which might slow down the run too. I think we should open a
Phabricator task for it.
Best
Xqt
Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at
https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py,
the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class
XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is
filtered by a namespace generator. This may MULTIPLY the running time in
some cases and this may cost hours or even days for a fix of complicated,
slow regexes.
I have just checked, that dump does contain namespace informÃ¡tion. So why
don't we filter during the scan?
I made an experiment. I modified my copy to display count of articles and
(ur'(\d)\s*%', ur'\1%'),
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole
dump, I used -xmlstart.) It went through 820 thousand pages and found 240+
matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from
live wiki, this time with namespace filtering on. (I don't replace in this
phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part
of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
--
BinÃ¡ris
_______________________________________________
pywikibot mailing list
https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________
pywikibot mailing list
https://lists.wikimedia.org/mailman/listinfo/pywikibot

--
BinÃ¡ris

i***@gno.de

2018-09-17 08:22:30 UTC

Permalink

Not for compat branch but for the core repository. You may use the Gerrit Patch Uploader [1] for it.

If you are unable to merge your patch into core send me your compat patch and I'll try to merge it.

[1] https://www.mediawiki.org/wiki/Gerrit_patch_uploader

Best

Xqt

Bináris

2018-09-17 10:42:53 UTC

Permalink

We enhance textlib.py so that replaceExcept() will have a new parameter.
This will make replaceExcept() to use a search rather than a replace. *This
is the good solution.* In this case the function could return a dummy
text which differs from original, so that we don't have to rewrite the
scripts which use it.

Also, if we have multiple (old, new) pairs in a fix, with this switch
replaceExcept() can return for the first match, thus the page will be
listed. This will accelerate it again.

Bináris

2018-09-17 17:57:02 UTC

Permalink

Post by BinÃ¡ris
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole
dump, I used -xmlstart.) It went through 820 thousand pages and found 240+
matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from
live wiki, this time with namespace filtering on. (I don't replace in this
phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part
of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(

I was not quite right. With the modified code it took 12 hours instead of
14, 630,000 pages were scanned instead of 820,000 and 83 matches found
instead of 240+ (of which 62 are real). Bt this is still not the same.