Class SimpleTextExtractionStrategy

  • All Implemented Interfaces:
    RenderListener, TextExtractionStrategy

    public class SimpleTextExtractionStrategy
    extends Object
    implements TextExtractionStrategy
    A simple text extraction renderer. This renderer keeps track of the current Y position of each string. If it detects that the y position has changed, it inserts a line break into the output. If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
    Since:
    2.1.5
    • Constructor Detail

      • SimpleTextExtractionStrategy

        public SimpleTextExtractionStrategy()
        Creates a new text extraction renderer.
    • Method Detail

      • beginTextBlock

        public void beginTextBlock()
        Description copied from interface: RenderListener
        Called when a new text block is beginning (i.e. BT)
        Specified by:
        beginTextBlock in interface RenderListener
        Since:
        5.0.1
      • endTextBlock

        public void endTextBlock()
        Description copied from interface: RenderListener
        Called when a text block has ended (i.e. ET)
        Specified by:
        endTextBlock in interface RenderListener
        Since:
        5.0.1
      • appendTextChunk

        protected final void appendTextChunk​(CharSequence text)
        Used to actually append text to the text results. Subclasses can use this to insert text that wouldn't normally be included in text parsing (e.g. result of OCR performed against image content)
        Parameters:
        text - the text to append to the text results accumulated so far
      • renderText

        public void renderText​(TextRenderInfo renderInfo)
        Captures text using a simplified algorithm for inserting hard returns and spaces
        Specified by:
        renderText in interface RenderListener
        Parameters:
        renderInfo - render info