ARTE Chatbot v0.1 - Initial Release

1031 exchange FAQ chatbot - Deferred.com

Deferred.com’s AI chatbot, the Advanced Real-estate Tax Expert (ARTE), is a state of the art large language model driven chatbot trained to be an expert on 1031 exchanges and related real estate tax matters.

To date, ARTE has passed the following professional certification and continuing education courses:

Section 1031 Real Property Like-Kind Exchanges (CPA Continuing Professional Education)
Concepts and Mechanics of Exchanges (IRS Enrolled Agnes & CPA Continuing Professional Education)
Real Estate Closings and 1031 Exchange (Continuing Legal Education for Attorneys)
Advanced 1031 Exchange Concepts and Opportunity Zones (Continuing Legal Education for Attorneys and American Institute of Professional Bookeepers)
Like-Kind Exchanges and Delaware Statutory Trusts Under Section 1031 (Continuing Legal Education for Attorneys)
Recording and Accounting for Asset Exchanges (General Professional Education)
Recording and Accounting for Asset Exchanges (Continuing Education for National Association of State Boards of Accountancy)

ARTE has also passed a comprehensive set of internal benchmarks gauging accuracy across a number of topical areas related to 1031 exchanges and across varying questions types. Below we dive into the details of our process, publish our scores across our benchmark, and compare our results against the most widespread, free, and publicly available consumer model.

‍

Performance Benchmarks

Topical Expertise

To assess ARTE’s performance over time, we’ve developed a set of internal tests we can use for benchmarking. The benchmarks are categorized by topical areas judged to be most important when answering deep technical tax questions as they relate to real estate investors performing 1031 exchanges.

For each topical area, a score of 70% is considered passing for a licensed professional. We hold ARTE to a high standard, requiring a score that is greater than what would be expected of a typical professional working in the field.

Topical area	Key topics	Benchmark Score
Tax code, Regulations, and Other Clarifications	Basic terminology and structure related to IRC Section 1031 Federal Tax Code and Regulations IRS Safe Harbor Rules, Clarifications and Other Rulings Calculating cost basis, boot and gain State and federal tax terms Case law Tax Filing Requirements	87.67%
1031 Exchange Process	Requirements for an exchange Timing requirements Identification requirements Flow of funds and Constructive Receipt	91.67%
Ownership considerations	Individual, joint, and spousal ownership Partnership, corporation and limited liability company concerns Disregarded entities Fractional ownership interests (TICs / DSTs) Related party transactions Common vesting issues Dealer and developer status	100.00%
Qualifying properties	Qualifying and non-qualifying property Like-kind property types Property usage and conversion of use Mixed-use property Primary residence considerations	81.82%
Types of Sales & Exchanges	Forward Exchanges Reverse Exchanges Simultaneous Exchanges Construction or Improvement exchange issues Multiple Property Exchanges Combination exchanges Installment sales (Section 453) Involuntary conversions (Section 1033) Mortgage and Financing Considerations	90.63%
History and Evolution of 1031 exchanges	History and Evolution of 1031 exchanges	64.00%

‍

Performance by question type

ARTE’s performance is currently benchmarked against two types of questions - Multiple Choice and Open Response.

Question Type	ARTE Benchmark Score
Multiple Choice	84.52%
Open Response	83.33%

‍

Comparison with Public Models

Topical Expertise

To understand our performance for our initial release, we’ve run a comparison against Open AI’s GPT3.5 model, which is broadly available and free to use for consumers in ChatGPT at the time of release.

ARTE significantly outperforms the the public models on our internal benchmarks. With a score of 70% required to pass, the public model fails our internal test and is much more likely to provide inaccurate or misleading tax advice when it comes to 1031 exchanges.

Topical area	ARTE Benchmark Score	Open AI GPT3.5	Delta
Tax code, Regulations, and Other Clarifications	87.67%	68.49%	+128%
1031 Exchange Process	91.67%	58.33%	+157%
Ownership considerations	100.00%	75.00%	+133%
Qualifying properties	81.82%	54.55%	+150%
Types of Sales & Exchanges	90.63%	43.75%	+207%
History and Evolution of 1031 exchanges	64.00%	56.00%	+114%

‍

Performance by question type

When comparing ARTE’s performance to GPT3.5 by question type, we significantly outperform.

Unsurprisingly, public model benchmarks have a much higher chance of performing well in a multiple choice evaluation given the bounded potential answers (though in this case still failed to achieve a passing grade).

However, with an open response question, the increased likeliness to hallucinate absent specific expertise in the subject matter leads to a much lower benchmark score and highlights ARTE’s outperformance. When dealing with complex questions that, when wrong, can lead to thousands or hundreds of thousands of dollars in tax liability, this difference is significant.

Question Type	ARTE Benchmark Score	GPT3.5 Benchmark Score	Delta
Multiple Choice	84.52%	59.52%	142%
Open Response	83.33%	50%	167%

‍

Methodology

Training data

ARTE is trained on public documents deemed relevant for 1031 exchanges. This includes the Internal Revenue Code, regulations, IRS rulings, case law, and other material used to provide guidance or set precedent when it comes to evaluating whether or not a 1031 exchange is qualifying and a taxpayer can successfully defer their capital gains.

‍

Courses

When determining if ARTE can pass a professional certification course, the course material is run through an evaluator using the same model configuration as our chatbot. Course materials have not been included in the training data to prevent bias based on recalling specific course content to answer the question. Once an evaluation is run, a human in the loop translates the evaluation output into the testing platform and submits the test to determine if a passing score was achieved.

‍

Benchmark & Evaluation Details

Benchmarks

Topical areas within the Benchmarks are not weighted in any way, which may skew results based on number of questions in each topical area.

There are a number of ways we’re interested in improving our benchmarks over time -

Increase the number of open response questions
Increase the number of questions in certain topical areas
Introduce a weighting of importance across topical areas to generate a more accurate overall score.

We’re also interest in assessing working professionals against our test set and establishing a baseline for the benchmark that we can compare ourselves to.

‍

Evaluations

Multiple Choice questions include 4 potential answers and are graded programmatically based on the letter answer output.

Open Response questions are compared with an expected answer provided by subject matter experts and are model graded based on Open AI’s model-graded Factual Evaluation template. Scores are assigned accordingly:

‍

Evaluation	Model Grades	Score
Correct and complete	(B) The submitted answer is a superset of the expert answer and is fully consistent with it.	1
Correct and complete	(C) The submitted answer contains all the same details as the expert answer.	1
Correct and partially complete	(A) The submitted answer is a subset of the expert answer and is fully consistent with it.	0.5
Incorrect	(D) There is a disagreement between the submitted answer and the expert answer.	0

Aaron LaRue

Aaron LaRue is a real estate and financial services professional. He has spent his career building technology products at places like Zillow and SoFi, and is passionate about making it easier for investors to buy and sell real estate.