PHP: Parsing Huge Xml without memory

PHP解析XML没有巨大的记忆

问题 (Question)

I would like to ask, if someone knows, if there is possibility to parse 500 MB xml. The problem looks like that. I have huge xml file, which has few nodes (maybe not few, but comparing to one of the nodes, well it is)) One of the nodes is attachment node, which is base 64 encoded. There is possibility, that this node can be around 500 MB.

Now my question is there any possibility to decode it and write to a file without running out of memory on the server? Changing limit over 1GB is not resolutions here.

Well I was looking at xmlReader, but as far as I seen I can reach the node, which I would like to parse, but then I need to save the node into memory, well that's bad idea. I need to append this node to file without reading it, but firstly I need to decode it.

To make it even funnier I will probably get such xml in parts, but I am planning to write parts to file one after another one. (Append it to the end of file)

我想问一下,如果有人知道,如果有可能,500 MB的XML解析。问题看起来像。我有很大的XML文件,其中有几个节点(也许不多,但对一个节点,它是))的一个节点的连接节点,这是基地64。有可能,这个节点可以是大约500 MB。

现在我的问题是有可能对其进行解码,写一个文件不运行的服务器上的内存?改变限制在1GB不是决议在这里。

我看着XmlReader,但据我看我能达到的节点,我想解析,但我需要保存节点到内存,这是错误的想法。我需要添加此节点的文件没有读它,但是首先我需要解码。

为了使它更有趣我可能会在部分XML,但我计划写的部分文件一个接一个。(追加到文件末尾)

最佳答案 (Best Answer)

Instead of XMLReader use XML Parser. It allows you to parse xml by chunks, so it if very memory efficient. Here is working example, that looks for <ATTACHMENT> tags and decodes it's contents into file. Dealing with base64 is easy, just remember that it turns every 3 characters into 4 character encoded string, so as long as you are supplying chunks of length dividable by 4, you can concatenate decoded result.

<?php

class ExtractAttachments {

    private $parser;
    private $tmpFile;
    private $tmpHandle;
    private $buffer;

    private $files = array();

    public function __construct($xml) {
        $this->parser = xml_parser_create('UTF-8');
        xml_set_object($this->parser, $this);
        xml_set_element_handler($this->parser, 'tag_start', 'tag_end');
        xml_set_character_data_handler($this->parser, 'cdata');
        $handle = fopen($xml, 'rb');
        while($string = fread($handle, 4096)) {
            xml_parse($this->parser, $string, false);
        }
        xml_parse($this->parser, '', true);
        fclose($handle);
        xml_parser_free($this->parser);
    }

    public function tag_start($parser, $tag, $attr) {
        if($tag == 'ATTACHMENT') {
            $this->tmpFile = tempnam(__DIR__, 'xml');
            $this->tmpHandle = fopen($this->tmpFile, 'wb');
        }
    }

    public function tag_end($parser, $tag) {
        if($this->tmpHandle) {
            if($this->buffer) {
                fwrite($this->tmpHandle, base64_decode($this->buffer));
                $this->buffer = '';
            }
            fclose($this->tmpHandle);
            $this->tmpHandle = null;
            $this->files[] = $this->tmpFile;
        }
    }

    public function cdata($parser, $data) {
        if ($this->tmpHandle) {
            $data = trim($data);
            if($this->buffer) {
                $data = $this->buffer . $data;
                $this->buffer = '';
            }
            if (0 != ($modulo = strlen($data)%4)) {
                $this->buffer = substr($data, -$modulo);
                $data = substr($data, 0, -$modulo);
            }
            fwrite($this->tmpHandle, base64_decode($data));
        }
    }

    public function getFiles(){
        return $this->files;
    }
}

$xml = new ExtractAttachments('large.xml');
$xml->getFiles();

而不是使用XmlReaderXML解析器。它允许你通过块解析XML,所以如果非常高效的内存。这是找工作的例子,<ATTACHMENT>标签和解码的内容到文件。处理base64是很容易的,只要记住它每3个字符为4个字符的字符串进行编码,所以只要你4的长度可分块提供,你可以将解码后的结果。

<?php

class ExtractAttachments {

    private $parser;
    private $tmpFile;
    private $tmpHandle;
    private $buffer;

    private $files = array();

    public function __construct($xml) {
        $this->parser = xml_parser_create('UTF-8');
        xml_set_object($this->parser, $this);
        xml_set_element_handler($this->parser, 'tag_start', 'tag_end');
        xml_set_character_data_handler($this->parser, 'cdata');
        $handle = fopen($xml, 'rb');
        while($string = fread($handle, 4096)) {
            xml_parse($this->parser, $string, false);
        }
        xml_parse($this->parser, '', true);
        fclose($handle);
        xml_parser_free($this->parser);
    }

    public function tag_start($parser, $tag, $attr) {
        if($tag == 'ATTACHMENT') {
            $this->tmpFile = tempnam(__DIR__, 'xml');
            $this->tmpHandle = fopen($this->tmpFile, 'wb');
        }
    }

    public function tag_end($parser, $tag) {
        if($this->tmpHandle) {
            if($this->buffer) {
                fwrite($this->tmpHandle, base64_decode($this->buffer));
                $this->buffer = '';
            }
            fclose($this->tmpHandle);
            $this->tmpHandle = null;
            $this->files[] = $this->tmpFile;
        }
    }

    public function cdata($parser, $data) {
        if ($this->tmpHandle) {
            $data = trim($data);
            if($this->buffer) {
                $data = $this->buffer . $data;
                $this->buffer = '';
            }
            if (0 != ($modulo = strlen($data)%4)) {
                $this->buffer = substr($data, -$modulo);
                $data = substr($data, 0, -$modulo);
            }
            fwrite($this->tmpHandle, base64_decode($data));
        }
    }

    public function getFiles(){
        return $this->files;
    }
}

$xml = new ExtractAttachments('large.xml');
$xml->getFiles();

本文翻译自StackoverFlow,英语好的童鞋可直接参考原文:http://stackoverflow.com/questions/22080947